GitHub user okram opened a pull request:

    https://github.com/apache/incubator-tinkerpop/pull/129

    TINKERPOP3-925: Use persisted SparkContext to persist an RDD across Spark 
jobs.

    https://issues.apache.org/jira/browse/TINKERPOP3-925
    
    This is implemented and its bad ass. There are now `PersistedOutputRDD` and 
`PersistedInputRDD` where the name of the RDD for the `SparkContext` is 
`outputLocation` and `inputLocation`. Tada! Now, if you have a chained 
GraphComputer job, you don't need to write the RDD to disk (e.g. HDFS), you can 
have `SparkContext` persist it across jobs. This work naturally extends the 
constructs we already have and thanks to @RussellSpitzer for implementing 
persistent Spark contexts for us. I updated various docs.
    
    * NOTE: I also renamed GraphComputer.config() to .configure() in this push. 
This is another ticket that was recently closed, but decided that configure() 
is a better name given the convention of the other GraphComputer methods.
    
    I ran `mvn clean install`, full integration tests, and built and published 
the docs.
      
http://tinkerpop.incubator.apache.org/docs/3.1.0-SNAPSHOT/#sparkgraphcomputer
    
    VOTE +1.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP3-925

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-tinkerpop/pull/129.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #129
    
----
commit 82bbc59e676a49ccafe54567735b320df84d60f7
Author: Marko A. Rodriguez <[email protected]>
Date:   2015-10-27T21:03:55Z

    added SparkHelper to grab RDDs from the SparkContext.getPersistedRDDs(). A 
simple test case proves it works. A more involved test case using 
BulkLoaderVertexProgram is needed.

commit 3d2d6a69086166ebd34ee10ade656c0e61a1ac0c
Author: Marko A. Rodriguez <[email protected]>
Date:   2015-10-27T22:02:42Z

    Added a test case that verifies a PageRankVertexProgram to 
BulkLoaderVertexProgram load into Spark without touching HDFS. Need to do the 
GraphComputer.config() ticket to make this all pretty.

commit 16b50052eb27ebb365341dc3b6e90608b07a71e6
Author: Marko A. Rodriguez <[email protected]>
Date:   2015-10-29T23:29:02Z

    merged master/.

commit 528ba027a098bc722211767b11b7dc010fb2cba1
Author: Marko A. Rodriguez <[email protected]>
Date:   2015-10-30T17:48:42Z

    This is a masterpiece here. PersistedXXXRDD is now a Spark RDD class where 
the inputLocation (outputLocation) are the names of the RDD. No HDFS is used 
between jobs as the graphRDD is stored in the SparkServer using a persisted 
context. Added test cases, renamed GraphComputer.config() to configure() to be 
consistent with the naming conventions of GraphComputer methods. Also made it 
default as most implementaitons won't need it and there is no point to require 
a random return this. Updated docs accordingly.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to