Hello,

I was working with Russell Spitzer and Jeremy Hanna today and we noted that 
native Spark takes 2.6 minutes to "g.V().count()" while SparkGraphComputer 
takes 4.5 minutes. Its understandable that SparkGraphComputer will be slower 
for such simple traversals given all the machinery it has in place to support 
arbitrary graph traversals. However, why not make it as faster?

...enter -- GraphComputer Provider-Specific TraversalStrategies.

With the release of TinkerPop 3.2.0, TraversalStrategies.GlobalCache can have 
TraversalStrategies registered that are associated with not only a Graph, but 
also a GraphComputer. The first such GraphComputer strategy was just created in 
TINKERPOP-1288 called SparkPartitionAwareStrategy 
[https://issues.apache.org/jira/browse/TINKERPOP-1288].

        
https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategy.java

What does it do?
        - If there is no message pass, then there is no need to partition the 
RDD across the cluster as that is a big shuffle and not worth the time and 
space.
How does it work?
        - It analyzes the traversal for VertexSteps that move beyond the 
StarVertex (i.e. a message pass). If no such steps exist, then a 
SparkGraphComputer-specific configuration is set to skip partitioning.

You can see how its registered -- just like Graph-provider strategies.
        
https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/SparkGraphComputer.java#L85-L87

Is it better?
        - Native spark via SparkContext.newHadoopAPI().count() takes 2.6 
minutes to count Friendster.
        - Without SparkPartitionAwareStrategy, counting Friendster takes 4.5 
minutes.
        - With SparkPartitionAwareStrategy, counting Friendster takes 4.0 
minutes.
        *** Not crazy faster, but its definitely faster. And given that 
applying strategies to OLAP traversals costs basically nothing (as opposed 
every microsecond counts with OLTP), why not save 30 seconds! :) 

So this is a simple use case that makes all non-traversal computations more 
efficient. However, we can imagine more useful strategies to write such as -- 
using native Spark for counting instead of SparkGraphComputer. That is, once 
the InputRDD is loaded, a bypass can be used to simply do "inputRDD.count()" 
and generate Iterator<Traverser<E>>. In this way, completely skipping all the 
semantics and infrastructure of SparkGraphComputer. I still need to think a bit 
on the best model for this, but already I know that TinkerGraphComputer and 
SparkGraphComputer will become blazing fast for such simple operations with 
GraphComputer provider-specifc strategies!

Finally, you can see the types of traversals that SparkPartitionAwareStrategy 
applies to in its test case:
        
https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/test/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategyTest.java#L86-L101

Thoughts?,
Marko.

http://markorodriguez.com

Reply via email to