[ https://issues.apache.org/jira/browse/TINKERPOP-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205114#comment-16205114 ]
ASF GitHub Bot commented on TINKERPOP-1786: ------------------------------------------- Github user pluradj commented on a diff in the pull request: https://github.com/apache/tinkerpop/pull/721#discussion_r144655107 --- Diff: docs/src/recipes/olap-spark-yarn.asciidoc --- @@ -0,0 +1,153 @@ +//// +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +//// +[[olap-spark-yarn]] +OLAP traversals with Spark on Yarn +---------------------------------- + +TinkerPop's combination of http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer] +and http://tinkerpop.apache.org/docs/current/reference/#_properties_files[HadoopGraph] allows for running +distributed, analytical graph queries (OLAP) on a computer cluster. The +http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference documentation] covers the cases +where Spark runs locally or where the cluster is managed by a Spark server. However, many users can only run OLAP jobs +via the http://hadoop.apache.org/[Hadoop 2.x] Resource Manager (Yarn), which requires `SparkGraphComputer` to be +configured differently. This recipe describes this configuration. + +Approach +~~~~~~~~ + +Most configuration problems of TinkerPop with Spark on Yarn stem from three reasons: + +1. `SparkGraphComputer` creates its own `SparkContext` so it does not get any configs from the usual `spark-submit` command. +2. The TinkerPop Spark plugin did not include Spark on Yarn runtime dependencies until version 3.2.7/3.3.1. +3. Resolving reason 2 by adding the cluster's `spark-assembly` jar to the classpath creates a host of version +conflicts, because Spark 1.x dependency versions have remained frozen since 2014. + +The current recipe follows a minimalist approach in which no dependencies are added to the dependencies +included in the TinkerPop binary distribution. The Hadoop cluster's Spark installation is completely ignored. This +approach minimizes the chance of dependency version conflicts. + +Prerequisites +~~~~~~~~~~~~~ +This recipe is suitable for both a real external and a local pseudo Hadoop cluster. While the recipe is maintained +for the vanilla Hadoop pseudo-cluster, it has been reported to work on real clusters with Hadoop distributions +from various vendors. + +If you want to try the recipe on a local Hadoop pseudo-cluster, the easiest way to install +it is to look at the install script at https://github.com/apache/tinkerpop/blob/x.y.z/docker/hadoop/install.sh +and the `start hadoop` section of https://github.com/apache/tinkerpop/blob/x.y.z/docker/scripts/build.sh. + +This recipe assumes that you installed the gremlin console with the +http://tinkerpop.apache.org/docs/x.y.z/reference/#spark-plugin[spark plugin] (the +http://tinkerpop.apache.org/docs/x.y.z/reference/#hadoop-plugin[hadoop plugin] is optional). Your Hadoop cluster +may have been configured to use file compression, e.g. lzo compression. If so, you need to copy the relevant +jar (e.g. `hadoop-lzo-*.jar`) to gremlin console's `ext/spark-gremlin/lib` folder. + +For starting the gremlin console in the right environment, create a shell script (e.g. `bin/spark-yarn.sh`) with the +contents below. Of course, actual values for `GREMLIN_HOME`, `HADOOP_HOME` and `HADOOP_CONF_DIR` need to be adapted to +your particular environment. + +[source] +---- +#!/bin/bash +# Variables to be adapted to the actual environment +GREMLIN_HOME=/home/yourdir/lib/apache-tinkerpop-gremlin-console-x.y.z-standalone +export HADOOP_HOME=/usr/local/lib/hadoop-2.7.2 +export HADOOP_CONF_DIR=/usr/local/lib/hadoop-2.7.2/etc/hadoop + +# Have TinkerPop find the hadoop cluster configs and hadoop native libraries +export CLASSPATH=$HADOOP_CONF_DIR +export JAVA_OPTIONS="-Djava.library.path=$HADOOP_HOME/lib/native:$HADOOP_HOME/lib/native/Linux-amd64-64" + +# Start gremlin-console without getting the HADOOP_GREMLIN_LIBS warning +cd $GREMLIN_HOME +[ ! -e empty ] && mkdir empty +export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/empty +bin/gremlin.sh +---- + +Running the job +~~~~~~~~~~~~~~~ + +You can now run a gremlin OLAP query with Spark on Yarn: + +[source] +---- +$ hdfs dfs -put data/tinkerpop-modern.kryo . +$ . bin/spark-yarn.sh +---- + +[gremlin-groovy] +---- +hadoop = System.getenv('HADOOP_HOME') +hadoopConfDir = System.getenv('HADOOP_CONF_DIR') +archive = 'spark-gremlin.zip' +archivePath = "/tmp/$archive" +['bash', '-c', "rm $archivePath 2>/dev/null; cd ext/spark-gremlin/lib && zip $archivePath *.jar"].execute() +conf = new PropertiesConfiguration('conf/hadoop/hadoop-gryo.properties') +conf.setProperty('spark.master', 'yarn-client') +conf.setProperty('spark.yarn.dist.archives', "$archivePath") +conf.setProperty('spark.yarn.appMasterEnv.CLASSPATH', "./$archive/*:$hadoopConfDir") +conf.setProperty('spark.executor.extraClassPath', "./$archive/*:$hadoopConfDir") +conf.setProperty('spark.driver.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64") +conf.setProperty('spark.executor.extraLibraryPath', "$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64") +conf.setProperty('gremlin.spark.persistContext', 'true') +graph = GraphFactory.open(conf) +g = graph.traversal().withComputer(SparkGraphComputer) +g.V().group().by(values('name')).by(both().count()) +---- + +If you run into exceptions, the best way to see what is going on is to look into the Yarn Resource Manager UI +(e.g. http://rm.your.domain:8088/cluster) to find the `applicationId` and get the logs using --- End diff -- could also use `yarn application -list` for a command line approach > Recipe and missing manifest items for Spark on Yarn > --------------------------------------------------- > > Key: TINKERPOP-1786 > URL: https://issues.apache.org/jira/browse/TINKERPOP-1786 > Project: TinkerPop > Issue Type: Improvement > Components: hadoop > Affects Versions: 3.3.0, 3.1.8, 3.2.6 > Environment: gremlin-console > Reporter: Marc de Lignie > Priority: Minor > Fix For: 3.2.7, 3.3.1 > > > Thorough documentation for running OLAP queries on Spark on Yarn has been > missing, keeping some users from getting the benefits of this nice feature of > the Tinkerpop stack and resulting in a significant number of questions on the > gremlin users list. -- This message was sent by Atlassian JIRA (v6.4.14#64029)