[GitHub] tinkerpop pull request #721: TINKERPOP-1786 Recipe and missing manifest item...

pluradj Sun, 15 Oct 2017 06:38:57 -0700

Github user pluradj commented on a diff in the pull request:

    https://github.com/apache/tinkerpop/pull/721#discussion_r144719670
  
    --- Diff: docs/src/recipes/olap-spark-yarn.asciidoc ---
    @@ -0,0 +1,153 @@
    +////
    +Licensed to the Apache Software Foundation (ASF) under one or more
    +contributor license agreements.  See the NOTICE file distributed with
    +this work for additional information regarding copyright ownership.
    +The ASF licenses this file to You under the Apache License, Version 2.0
    +(the "License"); you may not use this file except in compliance with
    +the License.  You may obtain a copy of the License at
    +
    +  http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing, software
    +distributed under the License is distributed on an "AS IS" BASIS,
    +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +See the License for the specific language governing permissions and
    +limitations under the License.
    +////
    +[[olap-spark-yarn]]
    +OLAP traversals with Spark on Yarn
    +----------------------------------
    +
    +TinkerPop's combination of 
http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]
    +and 
http://tinkerpop.apache.org/docs/current/reference/#_properties_files[HadoopGraph]
 allows for running
    +distributed, analytical graph queries (OLAP) on a computer cluster. The
    
+http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference
 documentation] covers the cases
    +where Spark runs locally or where the cluster is managed by a Spark 
server. However, many users can only run OLAP jobs
    +via the http://hadoop.apache.org/[Hadoop 2.x] Resource Manager (Yarn), 
which requires `SparkGraphComputer` to be
    +configured differently. This recipe describes this configuration.
    +
    +Approach
    +~~~~~~~~
    +
    +Most configuration problems of TinkerPop with Spark on Yarn stem from 
three reasons:
    +
    +1. `SparkGraphComputer` creates its own `SparkContext` so it does not get 
any configs from the usual `spark-submit` command.
    +2. The TinkerPop Spark plugin did not include Spark on Yarn runtime 
dependencies until version 3.2.7/3.3.1.
    +3. Resolving reason 2 by adding the cluster's `spark-assembly` jar to the 
classpath creates a host of version
    +conflicts, because Spark 1.x dependency versions have remained frozen 
since 2014.
    +
    +The current recipe follows a minimalist approach in which no dependencies 
are added to the dependencies
    +included in the TinkerPop binary distribution. The Hadoop cluster's Spark 
installation is completely ignored. This
    +approach minimizes the chance of dependency version conflicts.
    +
    +Prerequisites
    +~~~~~~~~~~~~~
    +This recipe is suitable for both a real external and a local pseudo Hadoop 
cluster. While the recipe is maintained
    +for the vanilla Hadoop pseudo-cluster, it has been reported to work on 
real clusters with Hadoop distributions
    +from various vendors.
    +
    +If you want to try the recipe on a local Hadoop pseudo-cluster, the 
easiest way to install
    +it is to look at the install script at 
https://github.com/apache/tinkerpop/blob/x.y.z/docker/hadoop/install.sh
    +and the `start hadoop` section of 
https://github.com/apache/tinkerpop/blob/x.y.z/docker/scripts/build.sh.
    +
    +This recipe assumes that you installed the gremlin console with the
    +http://tinkerpop.apache.org/docs/x.y.z/reference/#spark-plugin[spark 
plugin] (the
    +http://tinkerpop.apache.org/docs/x.y.z/reference/#hadoop-plugin[hadoop 
plugin] is optional). Your Hadoop cluster
    +may have been configured to use file compression, e.g. lzo compression. If 
so, you need to copy the relevant
    +jar (e.g. `hadoop-lzo-*.jar`) to gremlin console's `ext/spark-gremlin/lib` 
folder.
    +
    +For starting the gremlin console in the right environment, create a shell 
script (e.g. `bin/spark-yarn.sh`) with the
    +contents below. Of course, actual values for `GREMLIN_HOME`, `HADOOP_HOME` 
and `HADOOP_CONF_DIR` need to be adapted to
    +your particular environment.
    +
    +[source]
    +----
    +#!/bin/bash
    +# Variables to be adapted to the actual environment
    
+GREMLIN_HOME=/home/yourdir/lib/apache-tinkerpop-gremlin-console-x.y.z-standalone
    +export HADOOP_HOME=/usr/local/lib/hadoop-2.7.2
    +export HADOOP_CONF_DIR=/usr/local/lib/hadoop-2.7.2/etc/hadoop
    +
    +# Have TinkerPop find the hadoop cluster configs and hadoop native 
libraries
    +export CLASSPATH=$HADOOP_CONF_DIR
    +export 
JAVA_OPTIONS="-Djava.library.path=$HADOOP_HOME/lib/native:$HADOOP_HOME/lib/native/Linux-amd64-64"
    +
    +# Start gremlin-console without getting the HADOOP_GREMLIN_LIBS warning
    +cd $GREMLIN_HOME
    +[ ! -e empty ] && mkdir empty
    +export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/empty
    +bin/gremlin.sh
    +----
    +
    +Running the job
    +~~~~~~~~~~~~~~~
    +
    +You can now run a gremlin OLAP query with Spark on Yarn:
    +
    +[source]
    +----
    +$ hdfs dfs -put data/tinkerpop-modern.kryo .
    +$ . bin/spark-yarn.sh
    +----
    +
    +[gremlin-groovy]
    +----
    +hadoop = System.getenv('HADOOP_HOME')
    +hadoopConfDir = System.getenv('HADOOP_CONF_DIR')
    +archive = 'spark-gremlin.zip'
    +archivePath = "/tmp/$archive"
    +['bash', '-c', "rm $archivePath 2>/dev/null; cd ext/spark-gremlin/lib && 
zip $archivePath *.jar"].execute()
    +conf = new PropertiesConfiguration('conf/hadoop/hadoop-gryo.properties')
    +conf.setProperty('spark.master', 'yarn-client')
    +conf.setProperty('spark.yarn.dist.archives', "$archivePath")
    +conf.setProperty('spark.yarn.appMasterEnv.CLASSPATH', 
"./$archive/*:$hadoopConfDir")
    +conf.setProperty('spark.executor.extraClassPath', 
"./$archive/*:$hadoopConfDir")
    +conf.setProperty('spark.driver.extraLibraryPath', 
"$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
    +conf.setProperty('spark.executor.extraLibraryPath', 
"$hadoop/lib/native:$hadoop/lib/native/Linux-amd64-64")
    +conf.setProperty('gremlin.spark.persistContext', 'true')
    +graph = GraphFactory.open(conf)
    +g = graph.traversal().withComputer(SparkGraphComputer)
    +g.V().group().by(values('name')).by(both().count())
    +----
    +
    +If you run into exceptions, the best way to see what is going on is to 
look into the Yarn Resource Manager UI
    +(e.g. http://rm.your.domain:8088/cluster) to find the `applicationId` and 
get the logs using
    +`yarn logs -applicationId application_1498627870374_0008` from the command 
shell.
    +
    +Explanation
    +~~~~~~~~~~~
    +
    +This recipe does not require running the `bin/hadoop/init-tp-spark.sh` 
script described in the
    
+http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[reference
 documentation] and thus is also
    +valid for cluster users without access permissions to do so.
    +Rather, it exploits the `spark.yarn.dist.archives` property, which points 
to an archive with jars on the local file
    +system and is loaded into the various Yarn containers. As a result the 
`spark-gremlin.zip` archive becomes available
    +as the directory named `spark-gremlin.zip` in the Yarn containers. The 
`spark.executor.extraClassPath` and
    +`spark.yarn.appMasterEnv.CLASSPATH` properties point to the files inside 
this archive.
    +This is why they contain the `./spark-gremlin.zip/*` item. Just because a 
Spark executor got the archive with
    +jars loaded into its container, does not mean it knows how to access them.
    +
    +Also the `HADOOP_GREMLIN_LIBS` mechanism is not used because it can not 
work for Spark on Yarn as implemented (jars
    +added to the `SparkContext` are not available to the Yarn application 
master).
    +
    +The `gremlin.spark.persistContext` property is explained in the reference 
documentation of
    
+http://tinkerpop.apache.org/docs/current/reference/#sparkgraphcomputer[SparkGraphComputer]:
 it helps in getting
    +follow-up OLAP queries answered faster, because you skip the overhead for 
getting resources from Yarn.
    +
    +Additional configuration options
    +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    +This recipe does most of the graph configuration in the gremlin console so 
that environment variables can be used and
    +the chance of configuration mistakes is minimal. Once you have your setup 
working, it is probably easier to make a copy
    +of the `conf/hadoop/hadoop-gryo.properties` file and put the property 
values specific to your environment there. This is
    +also the right moment to take a look at the `spark-defaults.xml` file of 
your cluster, in particular the settings for
    +the Spark History Service, which allows you to access logs of finished 
jobs via the Yarn resource manager UI.
    --- End diff --
    
    link to the Spark documentation where it discusses how to configure the 
[Spark History Service](https://spark.apache.org/docs/latest/monitoring.html)

---

[GitHub] tinkerpop pull request #721: TINKERPOP-1786 Recipe and missing manifest item...

Reply via email to