[DISCUSS] GraphComputer Provider TraversalStrategies

Marko Rodriguez Tue, 03 May 2016 14:04:38 -0700

Hello,

I was working with Russell Spitzer and Jeremy Hanna today and we noted that 
native Spark takes 2.6 minutes to "g.V().count()" while SparkGraphComputer 
takes 4.5 minutes. Its understandable that SparkGraphComputer will be slower 
for such simple traversals given all the machinery it has in place to support 
arbitrary graph traversals. However, why not make it as faster?

...enter -- GraphComputer Provider-Specific TraversalStrategies.

With the release of TinkerPop 3.2.0, TraversalStrategies.GlobalCache can have
TraversalStrategies registered that are associated with not only a Graph, but
also a GraphComputer. The first such GraphComputer strategy was just created in
TINKERPOP-1288 called SparkPartitionAwareStrategy
[https://issues.apache.org/jira/browse/TINKERPOP-1288].

https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategy.java

What does it do?
- If there is no message pass, then there is no need to partition the
RDD across the cluster as that is a big shuffle and not worth the time and
space.
How does it work?
- It analyzes the traversal for VertexSteps that move beyond the
StarVertex (i.e. a message pass). If no such steps exist, then a
SparkGraphComputer-specific configuration is set to skip partitioning.

You can see how its registered -- just like Graph-provider strategies.

https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/SparkGraphComputer.java#L85-L87

Is it better?
- Native spark via SparkContext.newHadoopAPI().count() takes 2.6
minutes to count Friendster.
- Without SparkPartitionAwareStrategy, counting Friendster takes 4.5
minutes.
- With SparkPartitionAwareStrategy, counting Friendster takes 4.0
minutes.
*** Not crazy faster, but its definitely faster. And given that
applying strategies to OLAP traversals costs basically nothing (as opposed
every microsecond counts with OLTP), why not save 30 seconds! :)

So this is a simple use case that makes all non-traversal computations more
efficient. However, we can imagine more useful strategies to write such as --
using native Spark for counting instead of SparkGraphComputer. That is, once
the InputRDD is loaded, a bypass can be used to simply do "inputRDD.count()"
and generate Iterator<Traverser<E>>. In this way, completely skipping all the
semantics and infrastructure of SparkGraphComputer. I still need to think a bit
on the best model for this, but already I know that TinkerGraphComputer and
SparkGraphComputer will become blazing fast for such simple operations with
GraphComputer provider-specifc strategies!

Finally, you can see the types of traversals that SparkPartitionAwareStrategy
applies to in its test case:

https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/test/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategyTest.java#L86-L101

Thoughts?,
Marko.

http://markorodriguez.com

[DISCUSS] GraphComputer Provider TraversalStrategies

Reply via email to