Like Jackson Pollock, I just broke it wide open….. In TINKERPOP-1288, we now have the concept of a "NativeInterceptor." This interface is tied to SparkGraphComputer, but I think I can generalize it to work for any GraphComputer provider. (Also, probably call it VertexProgramInterceptor…)
https://github.com/apache/incubator-tinkerpop/blob/7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/NativeInterceptor.java A NativeInterceptor bypasses the execution of a VertexProgram and instead does what it needs with the Graph and Memory (i.e. ComputerResult). https://github.com/apache/incubator-tinkerpop/blob/7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/SparkGraphComputer.java#L245-L252 As an example, I created VertexCountInterceptor which does inputRDD.count(). Classy. https://github.com/apache/incubator-tinkerpop/blob/7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/interceptors/VertexCountInterceptor.java Drum roll….. - Native spark via SparkContext.newHadoopAPI().count() on Friendster takes 2.6 minutes. - Without SparkPartitionAwareStrategy, counting Friendster takes 4.5 minutes. - With SparkPartitionAwareStrategy, counting Friendster takes 4.0 minutes. - With both SparkPartitionAwareStrategy and SparkInterceptorStrategy, counting Friendster takes 2.4 minutes. And that, my friends, is how the dishes get done. Rip off shirt, flick off camera, and jump kick, Marko. http://markorodriguez.com On May 3, 2016, at 3:04 PM, Marko Rodriguez <okramma...@gmail.com> wrote: > Hello, > > I was working with Russell Spitzer and Jeremy Hanna today and we noted that > native Spark takes 2.6 minutes to "g.V().count()" while SparkGraphComputer > takes 4.5 minutes. Its understandable that SparkGraphComputer will be slower > for such simple traversals given all the machinery it has in place to support > arbitrary graph traversals. However, why not make it as faster? > > ...enter -- GraphComputer Provider-Specific TraversalStrategies. > > With the release of TinkerPop 3.2.0, TraversalStrategies.GlobalCache can have > TraversalStrategies registered that are associated with not only a Graph, but > also a GraphComputer. The first such GraphComputer strategy was just created > in TINKERPOP-1288 called SparkPartitionAwareStrategy > [https://issues.apache.org/jira/browse/TINKERPOP-1288]. > > > https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategy.java > > What does it do? > - If there is no message pass, then there is no need to partition the > RDD across the cluster as that is a big shuffle and not worth the time and > space. > How does it work? > - It analyzes the traversal for VertexSteps that move beyond the > StarVertex (i.e. a message pass). If no such steps exist, then a > SparkGraphComputer-specific configuration is set to skip partitioning. > > You can see how its registered -- just like Graph-provider strategies. > > https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/SparkGraphComputer.java#L85-L87 > > Is it better? > - Native spark via SparkContext.newHadoopAPI().count() takes 2.6 > minutes to count Friendster. > - Without SparkPartitionAwareStrategy, counting Friendster takes 4.5 > minutes. > - With SparkPartitionAwareStrategy, counting Friendster takes 4.0 > minutes. > *** Not crazy faster, but its definitely faster. And given that > applying strategies to OLAP traversals costs basically nothing (as opposed > every microsecond counts with OLTP), why not save 30 seconds! :) > > So this is a simple use case that makes all non-traversal computations more > efficient. However, we can imagine more useful strategies to write such as -- > using native Spark for counting instead of SparkGraphComputer. That is, once > the InputRDD is loaded, a bypass can be used to simply do "inputRDD.count()" > and generate Iterator<Traverser<E>>. In this way, completely skipping all the > semantics and infrastructure of SparkGraphComputer. I still need to think a > bit on the best model for this, but already I know that TinkerGraphComputer > and SparkGraphComputer will become blazing fast for such simple operations > with GraphComputer provider-specifc strategies! > > Finally, you can see the types of traversals that SparkPartitionAwareStrategy > applies to in its test case: > > https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/test/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategyTest.java#L86-L101 > > Thoughts?, > Marko. > > http://markorodriguez.com >