Hello, I was working with Russell Spitzer and Jeremy Hanna today and we noted that native Spark takes 2.6 minutes to "g.V().count()" while SparkGraphComputer takes 4.5 minutes. Its understandable that SparkGraphComputer will be slower for such simple traversals given all the machinery it has in place to support arbitrary graph traversals. However, why not make it as faster?
...enter -- GraphComputer Provider-Specific TraversalStrategies. With the release of TinkerPop 3.2.0, TraversalStrategies.GlobalCache can have TraversalStrategies registered that are associated with not only a Graph, but also a GraphComputer. The first such GraphComputer strategy was just created in TINKERPOP-1288 called SparkPartitionAwareStrategy [https://issues.apache.org/jira/browse/TINKERPOP-1288]. https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategy.java What does it do? - If there is no message pass, then there is no need to partition the RDD across the cluster as that is a big shuffle and not worth the time and space. How does it work? - It analyzes the traversal for VertexSteps that move beyond the StarVertex (i.e. a message pass). If no such steps exist, then a SparkGraphComputer-specific configuration is set to skip partitioning. You can see how its registered -- just like Graph-provider strategies. https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/SparkGraphComputer.java#L85-L87 Is it better? - Native spark via SparkContext.newHadoopAPI().count() takes 2.6 minutes to count Friendster. - Without SparkPartitionAwareStrategy, counting Friendster takes 4.5 minutes. - With SparkPartitionAwareStrategy, counting Friendster takes 4.0 minutes. *** Not crazy faster, but its definitely faster. And given that applying strategies to OLAP traversals costs basically nothing (as opposed every microsecond counts with OLTP), why not save 30 seconds! :) So this is a simple use case that makes all non-traversal computations more efficient. However, we can imagine more useful strategies to write such as -- using native Spark for counting instead of SparkGraphComputer. That is, once the InputRDD is loaded, a bypass can be used to simply do "inputRDD.count()" and generate Iterator<Traverser<E>>. In this way, completely skipping all the semantics and infrastructure of SparkGraphComputer. I still need to think a bit on the best model for this, but already I know that TinkerGraphComputer and SparkGraphComputer will become blazing fast for such simple operations with GraphComputer provider-specifc strategies! Finally, you can see the types of traversals that SparkPartitionAwareStrategy applies to in its test case: https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/test/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategyTest.java#L86-L101 Thoughts?, Marko. http://markorodriguez.com