GitHub user okram opened a pull request:
https://github.com/apache/incubator-tinkerpop/pull/129
TINKERPOP3-925: Use persisted SparkContext to persist an RDD across Spark
jobs.
https://issues.apache.org/jira/browse/TINKERPOP3-925
This is implemented and its bad ass. There are now `PersistedOutputRDD` and
`PersistedInputRDD` where the name of the RDD for the `SparkContext` is
`outputLocation` and `inputLocation`. Tada! Now, if you have a chained
GraphComputer job, you don't need to write the RDD to disk (e.g. HDFS), you can
have `SparkContext` persist it across jobs. This work naturally extends the
constructs we already have and thanks to @RussellSpitzer for implementing
persistent Spark contexts for us. I updated various docs.
* NOTE: I also renamed GraphComputer.config() to .configure() in this push.
This is another ticket that was recently closed, but decided that configure()
is a better name given the convention of the other GraphComputer methods.
I ran `mvn clean install`, full integration tests, and built and published
the docs.
http://tinkerpop.incubator.apache.org/docs/3.1.0-SNAPSHOT/#sparkgraphcomputer
VOTE +1.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP3-925
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-tinkerpop/pull/129.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #129
----
commit 82bbc59e676a49ccafe54567735b320df84d60f7
Author: Marko A. Rodriguez <[email protected]>
Date: 2015-10-27T21:03:55Z
added SparkHelper to grab RDDs from the SparkContext.getPersistedRDDs(). A
simple test case proves it works. A more involved test case using
BulkLoaderVertexProgram is needed.
commit 3d2d6a69086166ebd34ee10ade656c0e61a1ac0c
Author: Marko A. Rodriguez <[email protected]>
Date: 2015-10-27T22:02:42Z
Added a test case that verifies a PageRankVertexProgram to
BulkLoaderVertexProgram load into Spark without touching HDFS. Need to do the
GraphComputer.config() ticket to make this all pretty.
commit 16b50052eb27ebb365341dc3b6e90608b07a71e6
Author: Marko A. Rodriguez <[email protected]>
Date: 2015-10-29T23:29:02Z
merged master/.
commit 528ba027a098bc722211767b11b7dc010fb2cba1
Author: Marko A. Rodriguez <[email protected]>
Date: 2015-10-30T17:48:42Z
This is a masterpiece here. PersistedXXXRDD is now a Spark RDD class where
the inputLocation (outputLocation) are the names of the RDD. No HDFS is used
between jobs as the graphRDD is stored in the SparkServer using a persisted
context. Added test cases, renamed GraphComputer.config() to configure() to be
consistent with the naming conventions of GraphComputer methods. Also made it
default as most implementaitons won't need it and there is no point to require
a random return this. Updated docs accordingly.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---