[
https://issues.apache.org/jira/browse/TINKERPOP-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205114#comment-16205114
]
ASF GitHub Bot commented on TINKERPOP-1786:
-------------------------------------------
Github user pluradj commented on a diff in the pull request:
https://github.com/apache/tinkerpop/pull/721#discussion_r144655107
--- Diff: docs/src/recipes/olap-spark-yarn.asciidoc ---
@@ -0,0 +1,153 @@
+////
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+////
+[[olap-spark-yarn]]
+OLAP traversals with Spark on Yarn
+----------------------------------
+
+TinkerPop's combination of
http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]
+and
http://tinkerpop.apache.org/docs/current/reference/#_properties_files[HadoopGraph]
allows for running
+distributed, analytical graph queries (OLAP) on a computer cluster. The
+http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference
documentation] covers the cases
+where Spark runs locally or where the cluster is managed by a Spark
server. However, many users can only run OLAP jobs
+via the http://hadoop.apache.org/[Hadoop 2.x] Resource Manager (Yarn),
which requires `SparkGraphComputer` to be
+configured differently. This recipe describes this configuration.
+
+Approach
+~~~~~~~~
+
+Most configuration problems of TinkerPop with Spark on Yarn stem from
three reasons:
+
+1. `SparkGraphComputer` creates its own `SparkContext` so it does not get
any configs from the usual `spark-submit` command.
+2. The TinkerPop Spark plugin did not include Spark on Yarn runtime
dependencies until version 3.2.7/3.3.1.
+3. Resolving reason 2 by adding the cluster's `spark-assembly` jar to the
classpath creates a host of version
+conflicts, because Spark 1.x dependency versions have remained frozen
since 2014.
+
+The current recipe follows a minimalist approach in which no dependencies
are added to the dependencies
+included in the TinkerPop binary distribution. The Hadoop cluster's Spark
installation is completely ignored. This
+approach minimizes the chance of dependency version conflicts.
+
+Prerequisites
+~~~~~~~~~~~~~
+This recipe is suitable for both a real external and a local pseudo Hadoop
cluster. While the recipe is maintained
+for the vanilla Hadoop pseudo-cluster, it has been reported to work on
real clusters with Hadoop distributions
+from various vendors.
+
+If you want to try the recipe on a local Hadoop pseudo-cluster, the
easiest way to install
+it is to look at the install script at
https://github.com/apache/tinkerpop/blob/x.y.z/docker/hadoop/install.sh
+and the `start hadoop` section of
https://github.com/apache/tinkerpop/blob/x.y.z/docker/scripts/build.sh.
+
+This recipe assumes that you installed the gremlin console with the
+http://tinkerpop.apache.org/docs/x.y.z/reference/#spark-plugin[spark
plugin] (the
+http://tinkerpop.apache.org/docs/x.y.z/reference/#hadoop-plugin[hadoop
plugin] is optional). Your Hadoop cluster
+may have been configured to use file compression, e.g. lzo compression. If
so, you need to copy the relevant
+jar (e.g. `hadoop-lzo-*.jar`) to gremlin console's `ext/spark-gremlin/lib`
folder.
+
+For starting the gremlin console in the right environment, create a shell
script (e.g. `bin/spark-yarn.sh`) with the
+contents below. Of course, actual values for `GREMLIN_HOME`, `HADOOP_HOME`
and `HADOOP_CONF_DIR` need to be adapted to
+your particular environment.
+
+[source]
+----
+#!/bin/bash
+# Variables to be adapted to the actual environment
+GREMLIN_HOME=/home/yourdir/lib/apache-tinkerpop-gremlin-console-x.y.z-standalone
+export HADOOP_HOME=/usr/local/lib/hadoop-2.7.2
+export HADOOP_CONF_DIR=/usr/local/lib/hadoop-2.7.2/etc/hadoop
+
+# Have TinkerPop find the hadoop cluster configs and hadoop native
libraries
+export CLASSPATH=$HADOOP_CONF_DIR
+export
JAVA_OPTIONS="-Djava.library.path=$HADOOP_HOME/lib/native:$HADOOP_HOME/lib/native/Linux-amd64-64"
+
+# Start gremlin-console without getting the HADOOP_GREMLIN_LIBS warning
+cd $GREMLIN_HOME
+[ ! -e empty ] && mkdir empty
+export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/empty
+bin/gremlin.sh
+----
+
+Running the job
+~~~~~~~~~~~~~~~
+
+You can now run a gremlin OLAP query with Spark on Yarn:
+
+[source]
+----
+$ hdfs dfs -put data/tinkerpop-modern.kryo .
+$ . bin/spark-yarn.sh
+----
+
+[gremlin-groovy]
+----
+hadoop = System.getenv('HADOOP_HOME')
+hadoopConfDir = System.getenv('HADOOP_CONF_DIR')
+archive = 'spark-gremlin.zip'
+archivePath = "/tmp/$archive"
+['bash', '-c', "rm $archivePath 2>/dev/null; cd ext/spark-gremlin/lib &&
zip $archivePath *.jar"].execute()
+conf = new PropertiesConfiguration('conf/hadoop/hadoop-gryo.properties')
+conf.setProperty('spark.master', 'yarn-client')
+conf.setProperty('spark.yarn.dist.archives', "$archivePath")
+conf.setProperty('spark.yarn.appMasterEnv.CLASSPATH',
"./$archive/*:$hadoopConfDir")
+conf.setProperty('spark.executor.extraClassPath',
"./$archive/*:$hadoopConfDir")
+conf.setProperty('spark.driver.extraLibraryPath',
"$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
+conf.setProperty('spark.executor.extraLibraryPath',
"$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
+conf.setProperty('gremlin.spark.persistContext', 'true')
+graph = GraphFactory.open(conf)
+g = graph.traversal().withComputer(SparkGraphComputer)
+g.V().group().by(values('name')).by(both().count())
+----
+
+If you run into exceptions, the best way to see what is going on is to
look into the Yarn Resource Manager UI
+(e.g. http://rm.your.domain:8088/cluster) to find the `applicationId` and
get the logs using
--- End diff --
could also use `yarn application -list` for a command line approach
> Recipe and missing manifest items for Spark on Yarn
> ---------------------------------------------------
>
> Key: TINKERPOP-1786
> URL: https://issues.apache.org/jira/browse/TINKERPOP-1786
> Project: TinkerPop
> Issue Type: Improvement
> Components: hadoop
> Affects Versions: 3.3.0, 3.1.8, 3.2.6
> Environment: gremlin-console
> Reporter: Marc de Lignie
> Priority: Minor
> Fix For: 3.2.7, 3.3.1
>
>
> Thorough documentation for running OLAP queries on Spark on Yarn has been
> missing, keeping some users from getting the benefits of this nice feature of
> the Tinkerpop stack and resulting in a significant number of questions on the
> gremlin users list.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)