Awesome, And here is another way to do the dishes....
Sent from my iPhone > On May 3, 2016, at 6:25 PM, Marko Rodriguez <okramma...@gmail.com> wrote: > > Like Jackson Pollock, I just broke it wide open….. > > In TINKERPOP-1288, we now have the concept of a "NativeInterceptor." This > interface is tied to SparkGraphComputer, but I think I can generalize it to > work for any GraphComputer provider. (Also, probably call it > VertexProgramInterceptor…) > > > https://github.com/apache/incubator-tinkerpop/blob/7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/NativeInterceptor.java > > A NativeInterceptor bypasses the execution of a VertexProgram and instead > does what it needs with the Graph and Memory (i.e. ComputerResult). > > > https://github.com/apache/incubator-tinkerpop/blob/7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/SparkGraphComputer.java#L245-L252 > > As an example, I created VertexCountInterceptor which does inputRDD.count(). > Classy. > > > https://github.com/apache/incubator-tinkerpop/blob/7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/interceptors/VertexCountInterceptor.java > > Drum roll….. > > - Native spark via SparkContext.newHadoopAPI().count() on Friendster > takes 2.6 minutes. > - Without SparkPartitionAwareStrategy, counting Friendster takes 4.5 > minutes. > - With SparkPartitionAwareStrategy, counting Friendster takes 4.0 > minutes. > - With both SparkPartitionAwareStrategy and SparkInterceptorStrategy, > counting Friendster takes 2.4 minutes. > > And that, my friends, is how the dishes get done. > > Rip off shirt, flick off camera, and jump kick, > Marko. > > http://markorodriguez.com > >> On May 3, 2016, at 3:04 PM, Marko Rodriguez <okramma...@gmail.com> wrote: >> >> Hello, >> >> I was working with Russell Spitzer and Jeremy Hanna today and we noted that >> native Spark takes 2.6 minutes to "g.V().count()" while SparkGraphComputer >> takes 4.5 minutes. Its understandable that SparkGraphComputer will be slower >> for such simple traversals given all the machinery it has in place to >> support arbitrary graph traversals. However, why not make it as faster? >> >> ...enter -- GraphComputer Provider-Specific TraversalStrategies. >> >> With the release of TinkerPop 3.2.0, TraversalStrategies.GlobalCache can >> have TraversalStrategies registered that are associated with not only a >> Graph, but also a GraphComputer. The first such GraphComputer strategy was >> just created in TINKERPOP-1288 called SparkPartitionAwareStrategy >> [https://issues.apache.org/jira/browse/TINKERPOP-1288]. >> >> >> https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategy.java >> >> What does it do? >> - If there is no message pass, then there is no need to partition the >> RDD across the cluster as that is a big shuffle and not worth the time and >> space. >> How does it work? >> - It analyzes the traversal for VertexSteps that move beyond the >> StarVertex (i.e. a message pass). If no such steps exist, then a >> SparkGraphComputer-specific configuration is set to skip partitioning. >> >> You can see how its registered -- just like Graph-provider strategies. >> >> https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/SparkGraphComputer.java#L85-L87 >> >> Is it better? >> - Native spark via SparkContext.newHadoopAPI().count() takes 2.6 >> minutes to count Friendster. >> - Without SparkPartitionAwareStrategy, counting Friendster takes 4.5 >> minutes. >> - With SparkPartitionAwareStrategy, counting Friendster takes 4.0 >> minutes. >> *** Not crazy faster, but its definitely faster. And given that >> applying strategies to OLAP traversals costs basically nothing (as opposed >> every microsecond counts with OLTP), why not save 30 seconds! :) >> >> So this is a simple use case that makes all non-traversal computations more >> efficient. However, we can imagine more useful strategies to write such as >> -- using native Spark for counting instead of SparkGraphComputer. That is, >> once the InputRDD is loaded, a bypass can be used to simply do >> "inputRDD.count()" and generate Iterator<Traverser<E>>. In this way, >> completely skipping all the semantics and infrastructure of >> SparkGraphComputer. I still need to think a bit on the best model for this, >> but already I know that TinkerGraphComputer and SparkGraphComputer will >> become blazing fast for such simple operations with GraphComputer >> provider-specifc strategies! >> >> Finally, you can see the types of traversals that >> SparkPartitionAwareStrategy applies to in its test case: >> >> https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/test/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategyTest.java#L86-L101 >> >> Thoughts?, >> Marko. >> >> http://markorodriguez.com > > -- > You received this message because you are subscribed to the Google Groups > "Gremlin-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to gremlin-users+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/gremlin-users/F6E2D934-7B30-4D9F-99BA-02C313B46977%40gmail.com. > For more options, visit https://groups.google.com/d/optout.