Re: GraphActors as a new distributed computing framework in TinkerPop

HadoopMarc Wed, 14 Dec 2016 12:10:44 -0800

Hi Marko,

This is pretty neat indeed! It will enhance the productivity of gremlin 
query writers when queries can be copied from the fora without worries 
about the kind of backend used and whether it is an analytical (OLAP) or a 
transactional query (OLTP). I have only one idea: do traversal API users 
still really have to know whether they use a GraphComputer or GraphActors? 
In other words, can the withEngine options not just be some illuminating 
token constants for users that just want to have the traversal() returned 
(LOCAL, LOCAL_DISTRIBUTED, DISTRIBUTED)?  Of course, the more extended API 
will be useful for a minority of power users that want to optimize an 
ActorProgram for a specific use case.


Cheers,    Marc

Op woensdag 14 december 2016 18:46:44 UTC+1 schreef Marko A. Rodriguez:
>
> Hello,
>
> For the last week I’ve been working on “distributed OLTP.” Gremlin has a 
> really nice architecture in that a traverser can be shipped around a 
> cluster and reattached to its respective element (vertex/edge/etc.) and 
> step (traversal) at the remote location and continue to compute. Thus, we 
> can have step-by-step query routing.
>
> https://issues.apache.org/jira/browse/TINKERPOP-1564
>
> With that, I’ve created GraphActors which is similar to GraphComputer. 
> However, there are some fundamental distinctions:
>
> 1. GraphActors assumes the boundary of computation is a Partition.
> - GraphComputer assumes the boundary of computation is vertex and its 
> incident edges and properties.
> 2. GraphActors assumes asynchronous computation with barriers at Barrier 
> steps.
> - GraphComputer assumes (sorta) synchronous computation with a barrier 
> when all traversers have left their local vertex.
> 3. GraphActors is traverser-centric and partition-bound.
> - GraphComputer is vertex-centric and vertex-bound.
>
> In gremlin-core/ I’ve created a new set of interfaces off of process/.
>
>
> https://github.com/apache/tinkerpop/tree/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor
> GraphActors <=> GraphComputer
> MasterActor <=> setup()/terminate().
> WorkerActor <=> execute()
> ActorProgram <=> VertexProgram
>
> The parallel between GraphComputer and GraphActors are strong. In short, a 
> (hardcore) user can create an ActorProgram and submit it to a GraphActors. 
> The ActorProgram will effect a distributed, asynchronous, partition-bound 
> message passing algorithm and return a Future<Result>. There is one 
> ActorProgram in particular the executes a Gremlin traversal.
>
>
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor/traversal/TraversalActorProgram.java
>   master actor program: 
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor/traversal/TraversalMasterProgram.java
>   worker actors program: 
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor/traversal/TraversalWorkerProgram.java
>
> Pretty simple. Besides some problems I’m having with serialization stuff 
> in GroupStep, the ProcessSuite passes.
>
> Now, its up to a provider to implement the GraphActors interfaces. Welcome 
> akka-gremlin/.
>
>
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/AkkaGraphActors.java
>   master actor: 
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/MasterActor.java
>   worker actor: 
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/WorkerActor.java
> mailbox system: 
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/ActorMailbox.java
>
> Dead simple!
>
> What we should have done with TinkerPop from the start is include the 
> notion of a Partition. For this branch, I’ve added two concepts Partition 
> and Partitioner.
>
>
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/structure/Partitioner.java
>
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/structure/Partition.java
>
> Why is this cool? This is cool because if GraphActors system knows the 
> partitions of the underlying Graph data, then it can immediately process 
> the Graph data in a distributed manner. No need to write a custom 
> “InputFormat.” We should have done this from the start because then 
> GraphComputer could do the same. For instance, spark-gremlin/ can run over 
> TinkerGraph as it doesn’t care about TinkerGraph, it cares about Partition 
> “input splits.” By adding this layer of information, ANY Graph can work 
> with ANY GraphComputer or GraphActors. I have yet to create 
> PartitionInputRDD and PartitionInputFormat, but that will be next and at 
> that point, GraphComputers are agnostic to the underlying implementation.
>
> So there you have it. Your thoughts on the matter would be most 
> appreciated.
>
> Things still left to do:
>
> * We need a concept of a traversal engine. That is, something like:
> - g.withEngine(SparkGraphComputer.class) // it knows its a GraphComputer 
> so thats the engine.
> - g.withEngine(AkkaGraphActors.class) // it knows its a GraphActors so 
> thats the engine.
> - g.withEngine(Iterator.class) // this means, just iterate the traversal 
> locally :).
> * GraphComputer semantics are a restricted version of GraphActors 
> semantics.
> - GraphActors becomes GraphComputer when the Partitions are defined by 
> vertices.
> - I think I can unify the two and thus, we could have SparkGraphActors.
>
>
>
> Here is some fun playing:
>
>
>          \,,,/
>          (o o)
> -----oOOo-(3)-oOOo-----
> plugin activated: tinkerpop.server
> plugin activated: tinkerpop.utilities
> plugin activated: tinkerpop.tinkergraph
> gremlin> :install org.apache.tinkerpop akka-gremlin 3.3.0-SNAPSHOT
> ==>Loaded: [org.apache.tinkerpop, akka-gremlin, 3.3.0-SNAPSHOT]
> gremlin> graph = TinkerFactory.createModern()
> ==>tinkergraph[vertices:6 edges:6]
> gremlin> graph.partitioner()
> ==>partitioner[globalpartitioner:1]
> gremlin> partitioner = new HashPartitioner(graph.partitioner(),3)          
> *// 
> lets create 3 logical partitions over TinkerGraph*
> ==>partitioner[hashpartitioner:3]
> gremlin> g = graph.traversal().withStrategies(new 
> ActorProgramStrategy(AkkaGraphActors,partitioner)) *// in the future 
> withEngine() will be used*
> ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], actors]
> gremlin> g.V().repeat(out()).times(2).values('name')
> ==>lop
> ==>ripple
> gremlin> g.V().repeat(both()).times(2).groupCount().by(out().in().count()) 
>  *// beyond the star graph!*
> ==>[0:13,3:3,4:7,5:7]
> gremlin> g.V().match(                                   *// distributed 
> pattern matching*
> ......1>  __.as('a').out('created').as('b'),
> ......2>  __.as('b').has('name', 'lop'),
> ......3>  __.as('b').in('created').as('c'),
> ......4>  __.as('c').has('age', 29)).
> ......5>   select('a','c').by('name')
> ==>[a:marko,c:marko]
> ==>[a:josh,c:marko]
> ==>[a:peter,c:marko]
> gremlin>
>
>
> Now imagine this executing over various providers:
>
> 1. *A sharded graph database* (Titan/DSEGraph/OrientDB): the traversers 
> move between machines so they are always data local processing.
> 2. *A replicated graph database* (Neo4j): logical partitions are created 
> so that each machine is responsible for a subgraph of the full graph (load 
> balancing and parallization).
> 3. *A single machine graph database* (TinkerGraph): logical partitions 
> are created so that each core of the machine is responsible for a subgraph 
> (parallization).
>
> Pretty neat, eh?
>
> Ideas are more than welcome,
> Marko.
>
> http://markorodriguez.com
>
>
>
>

Re: GraphActors as a new distributed computing framework in TinkerPop

Reply via email to