Awesome, 
And here is another way to do the dishes....


Sent from my iPhone

> On May 3, 2016, at 6:25 PM, Marko Rodriguez <okramma...@gmail.com> wrote:
> 
> Like Jackson Pollock, I just broke it wide open…..
> 
> In TINKERPOP-1288, we now have the concept of a "NativeInterceptor." This 
> interface is tied to SparkGraphComputer, but I think I can generalize it to 
> work for any GraphComputer provider. (Also, probably call it 
> VertexProgramInterceptor…)
> 
>       
> https://github.com/apache/incubator-tinkerpop/blob/7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/NativeInterceptor.java
> 
> A NativeInterceptor bypasses the execution of a VertexProgram and instead 
> does what it needs with the Graph and Memory (i.e. ComputerResult). 
> 
>       
> https://github.com/apache/incubator-tinkerpop/blob/7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/SparkGraphComputer.java#L245-L252
> 
> As an example, I created VertexCountInterceptor which does inputRDD.count(). 
> Classy.
> 
>       
> https://github.com/apache/incubator-tinkerpop/blob/7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/interceptors/VertexCountInterceptor.java
> 
> Drum roll…..
> 
>       - Native spark via SparkContext.newHadoopAPI().count() on Friendster 
> takes 2.6 minutes.
>       - Without SparkPartitionAwareStrategy, counting Friendster takes 4.5 
> minutes.
>       - With SparkPartitionAwareStrategy, counting Friendster takes 4.0 
> minutes.
>       - With both SparkPartitionAwareStrategy and SparkInterceptorStrategy, 
> counting Friendster takes 2.4 minutes.
> 
> And that, my friends, is how the dishes get done.
> 
> Rip off shirt, flick off camera, and jump kick,
> Marko.
> 
> http://markorodriguez.com
> 
>> On May 3, 2016, at 3:04 PM, Marko Rodriguez <okramma...@gmail.com> wrote:
>> 
>> Hello,
>> 
>> I was working with Russell Spitzer and Jeremy Hanna today and we noted that 
>> native Spark takes 2.6 minutes to "g.V().count()" while SparkGraphComputer 
>> takes 4.5 minutes. Its understandable that SparkGraphComputer will be slower 
>> for such simple traversals given all the machinery it has in place to 
>> support arbitrary graph traversals. However, why not make it as faster?
>> 
>> ...enter -- GraphComputer Provider-Specific TraversalStrategies.
>> 
>> With the release of TinkerPop 3.2.0, TraversalStrategies.GlobalCache can 
>> have TraversalStrategies registered that are associated with not only a 
>> Graph, but also a GraphComputer. The first such GraphComputer strategy was 
>> just created in TINKERPOP-1288 called SparkPartitionAwareStrategy 
>> [https://issues.apache.org/jira/browse/TINKERPOP-1288].
>> 
>>      
>> https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategy.java
>> 
>> What does it do?
>>      - If there is no message pass, then there is no need to partition the 
>> RDD across the cluster as that is a big shuffle and not worth the time and 
>> space.
>> How does it work?
>>      - It analyzes the traversal for VertexSteps that move beyond the 
>> StarVertex (i.e. a message pass). If no such steps exist, then a 
>> SparkGraphComputer-specific configuration is set to skip partitioning.
>> 
>> You can see how its registered -- just like Graph-provider strategies.
>>      
>> https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/SparkGraphComputer.java#L85-L87
>> 
>> Is it better?
>>      - Native spark via SparkContext.newHadoopAPI().count() takes 2.6 
>> minutes to count Friendster.
>>      - Without SparkPartitionAwareStrategy, counting Friendster takes 4.5 
>> minutes.
>>      - With SparkPartitionAwareStrategy, counting Friendster takes 4.0 
>> minutes.
>>      *** Not crazy faster, but its definitely faster. And given that 
>> applying strategies to OLAP traversals costs basically nothing (as opposed 
>> every microsecond counts with OLTP), why not save 30 seconds! :) 
>> 
>> So this is a simple use case that makes all non-traversal computations more 
>> efficient. However, we can imagine more useful strategies to write such as 
>> -- using native Spark for counting instead of SparkGraphComputer. That is, 
>> once the InputRDD is loaded, a bypass can be used to simply do 
>> "inputRDD.count()" and generate Iterator<Traverser<E>>. In this way, 
>> completely skipping all the semantics and infrastructure of 
>> SparkGraphComputer. I still need to think a bit on the best model for this, 
>> but already I know that TinkerGraphComputer and SparkGraphComputer will 
>> become blazing fast for such simple operations with GraphComputer 
>> provider-specifc strategies!
>> 
>> Finally, you can see the types of traversals that 
>> SparkPartitionAwareStrategy applies to in its test case:
>>      
>> https://github.com/apache/incubator-tinkerpop/blob/TINKERPOP-1288/spark-gremlin/src/test/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/optimization/SparkPartitionAwareStrategyTest.java#L86-L101
>> 
>> Thoughts?,
>> Marko.
>> 
>> http://markorodriguez.com
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Gremlin-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to gremlin-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/gremlin-users/F6E2D934-7B30-4D9F-99BA-02C313B46977%40gmail.com.
> For more options, visit https://groups.google.com/d/optout.

Reply via email to