[ 
https://issues.apache.org/jira/browse/TINKERPOP-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725548#comment-15725548
 ] 

Marko A. Rodriguez commented on TINKERPOP-1564:
-----------------------------------------------

Here are some snippets from some emails with a colleague around the 
"motivations for distributed OLTP."

---

1. Subgraph computing vs. vertex-centric computing. 
  * There is no need to pass messages if you are not going to leave the 
partition/machine.
  * We have lots of joins() in Gremlin OLAP that aren’t needed if the boundary 
was the partition, not the vertex.

2. Remove the local star graph constraint.
  * We currently have “two ways” of doing things in Gremlin. 
  * Kuppitz’s reply to users will typically be of the form “in OLTP your query 
is” … “in OLAP your query is.”

3. Bulk Synchronous Processing vs. Asynchronous Processing.
  * Currently every iteration in GraphComputer is constrained by the slowest 
vertex. 
  * With Gremlin, we know when to barrier (Barrier steps) so we have the luxury 
of ASP semantics.

4. Migrating the entire graph into an RDD/SequenceFile when you might not even 
need the data.
  * If we think of an “OLTP continuum" (not just {{g.V().out().count()}}), you 
will only touch a subset of your graph.
  * GraphFilters helps to alleviate this in GraphComputer, but its not “what 
exact vertex, edge, property do you need?” Its always, "what type of vertices, 
edges, properties do you need?”
  * No more serialization issues.

5. Query routing.
  * We currently don’t have query routing capabilities. The proposal provides 
step-by-step query routing. Always “data local” processing.

6. Leverage vendor optimizations.
  * OLAP is all about StarGraph. We don’t have vertex-centric indices, sort 
orders, etc.
  * The vendors same strategies are leveraged in distributed OLTP.
  * At the limit, no more “{{if(TraversalHelper.onGraphComputer()) do one thing 
else do another}}"(when designing strategies).

7. Memory issues due to eager evaluation.
  * If you can’t bulk all your messages, then you can have massive message 
pools.
  * This is why path analysis in Gremlin OLAP is nasty. It is also nasty in 
OLTP, but you have the benefit of lazy computations/barriers and thus, don’t 
suffer the same memory problems.
  * With ASP, traversers can stall waiting to process until memory is freed up 
and thus, we get lazy evaluation in this model.

----

So, traversals of the form {{g.V().out().count()}} are very conducive to linear 
scan data processing like we see with Spark/Hadoop/etc. You touch everything — 
and/or you know what you will touch on each iteration. There are still other 
issues, but this is where I think Gremlin OLAP is better than distributed 
Gremlin OLTP (in theory thus far).

Now, suppose

{code}
g.V().match(a,b,c,d).select(...). 
{code}

Distributed OLTP is best for such traversals because:

1. match() uses the “cost based algorithm” Matthias developed in his Ph.D. 
thesis. Though w/ a few introspection tweaks given what we know about how the 
Gremlin VM behaves.
  * The traversers can get “out of sync” because while, at a particular 
iteration, pattern A might be used for one traverser, pattern B might be used 
for another.
    - This means any pre-determination of a subgraph to load on the next 
iteration in a Spark environment will load the data for both pattern A and B.
    - But “out of sync” is a good thing in that your match() is learning and 
sorting the patterns dynamically.

2. These types of traversals are filter heavy, where you are reducing your 
search space constantly trying to find binding patterns (as opposed to 
g.V().out().count()).
  * With Gremlin OLAP, you will still be processing at every vertex even if 
that vertex is no longer part of the reduced subgraph.
  * With Gremlin OLTP, you are “traverser centric” (not “vertex centric”) and 
thus, you are always only touching data being accessed.
    - As that search space gets smaller and smaller, you basically have 
isolated OLTP queries that just get merged into a single result set at the end.

3. match() uses LABELED_PATHs which are hard to bulk. Not as crazy as PATH, but 
still, its never a one-to-one bulk from vertex to traverser location.
  * This is because you need to have path history in match() — what bound to 
“a” previously? what bound to “b”? etc. etc. Traverser history/memory is the 
dev null of bulking. No way around it.
  * PathRetractionStrategy is great at pruning paths, but still it depends on 
how many variables you want in your select().
  * Gremlin OLAP may fall flat on its face with memory issues if you have lots 
of variables (a.out.b, b.in.c, c.blah.d, d.out.e, etc.).
  * Gremlin OLTP is able to use lazy barriers and never have such memory issues 
(at the cost of time as you are “smearing" your bulks through time).
        
Point being, Gremlin OLAP is very “heavy handed.” This is great for 
{{g.V().out().count()}}. This is not so great for what may look like a 
“{{g.V()}}” query but then gets widdled down to only a few subgraphs across the 
cluster — exactly the type of behaviors you get with {{match()}}.


> Distributed OLTP Traversals and the Introduction of Partition Concept
> ---------------------------------------------------------------------
>
>                 Key: TINKERPOP-1564
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1564
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: driver, process, server
>    Affects Versions: 3.2.3
>            Reporter: Marko A. Rodriguez
>
> This proposal unifies OLTP and OLAP into a single framework that removes the 
> need for OLAP {{GraphComputer}} by introducing distributed, data local 
> processing to OLTP. In essence, this is a proposal for a step-by-step query 
> routing framework within {{Traversal}}. This proposal can work across 
> machines in a cluster, threads on a machine, or in a hierarchical fashion 
> machines&threads. The example presented will discuss distribution across 
> machines in a cluster as its the most complicated scenario.
> Currently, an OLTP traversal executes at a particular machine (or thread) and 
> pulls vertex/edge/etc. data to it accordingly in order to solve the 
> traversal. In OLAP, the traversal is cloned and distributed to all machines 
> in the cluster and traversals communicate with one another by sending 
> {{Traversers}} (i.e. messages) between themselves ensuring data local 
> processing. Given recent advancements in GremlinServer and 
> {{RemoteTraversal}}, it is possible to add traverser routing to OLTP and 
> thus, effect the computational paradigm of Gremlin OLAP in Gremlin OLTP with 
> some added benefits not possible in Gremlin OLAP.
> Assume a 4 machine cluster and the following traversal:
> {code}
> g.V(1).out(‘knows’).has(‘age’,gt(20)).out(‘likes’).values(‘name’)
> {code}
> Every time there is a "walk" (adjacency), it is possible that the 
> {{Traverser}} is no longer accessing data local to the current machine. In 
> order to do data local query routing, every adjacency would feed into a 
> {{PartitionStep}}. The traversal above would be cloned (via {{Bytecode}} 
> distribution) across the cluster where "sibling" {{PartitionSteps}} would 
> have network access to one another using the same protocol of 
> {{RemoteConnection}} though called {{PartitionConnection}}. Thus, given the 4 
> node cluster example, the above traversal would be overlaid as below. Note 
> that {{partition()}} would not be a new step in the language, but simply 
> provided here to show where {{PartitionStrategy}} would insert 
> {{PartitionSteps}} into the traversal.
> {code}
> g.V(1).out(‘knows’).partition().has(‘age’,gt(20)).out(‘likes’).partition().values(‘name’).partition()
>                        |                                           |          
>                ^
>     
> __.out(‘knows’).partition().has(‘age’,gt(20)).out(‘likes’).partition().values(‘name’).partition()
>                        |                                           |          
>                |
>     
> __.out(‘knows’).partition().has(‘age’,gt(20)).out(‘likes’).partition().values(‘name’).partition()
>                        |                                           |          
>                |
>     
> __.out(‘knows’).partition().has(‘age’,gt(20)).out(‘likes’).partition().values(‘name’).partition()
> {code}
> The top traversal is called the "master traversal" and the other three 
> "worker traversals." Note that this is identical to current Gremlin OLAP. 
> Now, the master traversal would be the traversal that is {{.next()}}'d for 
> results. So, when the "master traversal" is {{next()}}'d, {{g.V(1)}} will 
> fetch {{v[1]}} and then its outgoing knows-adjacencies. These adjacent 
> "reference vertices" would be fed into the first {{remote()}} and a "routing 
> algorithm" would determine where in the cluster the particular vertex's data 
> is. Thus, {{partition()}} ({{PartitionStep}}) serves as a router, pushing 
> {{Traversers}} local to the data. Finally, note that the final 
> {{PartitionSteps}} can only feed back to the "master traversal" for ultimate 
> aggregation and return to the user. 
> TinkerPop currently has all the structures in place to make this possible:
>       1. Encapsulation of computational metadata via {{Traverser}}.
>       2. The ability to detach {{Traversers}} and migrate/serialize them via 
> {{Traverser.detach()}} and {{Traverser.attach()}}.
>       3. The concept of {{ReferenceElement}} so the traverser only carries 
> with it enough information to re-attach at the remote site.
>       4. {{Bytecode}} and the ability to send {{Traversals}} across the 
> cluster.
>       5. GremlinServer and {{Client}}/{{Cluster}} messaging protocol.
> What does {{PartitionStep}} look like? *Please see comments below*
> Here are the benefits of this model:
> * Gremlin OLTP is Gremlin OLAP. The semantics of Gremlin OLAP are exactly 
> what is proposed here but with the added benefit that message passing happens 
> at the partition/subgraph level, not the star vertex level.
> * There is no need for {{SparkGraphComputer}} as GremlinServer now plays the 
> role of SparkServer. The added benefit, no pulling data from the graph 
> database and re-representing it in an RDD or {{SequenceFile}}.
> * No longer are "local children traversals" the boundary for "OLAP." Local 
> children can be processed beyond the star graph, but would require pulling 
> data from a remote machine is necessary. However, given a good graph 
> partitioning algorithm, local children will most likely NOT leave the 
> subgraph partition and thus, will remain a local computation.
> * Failover is already built into the architecture. If a {{PartitionStep}} can 
> not be accessed, but the machine's data is still available (perhaps via 
> replication), then data will simply be pulled over the wire instead of 
> traversers routed to the "dead node." 
> * The infrastructure for side-effects and reducing barrier steps already 
> implemented for Gremlin OLAP would automatically work for distributed Gremlin 
> OLTP.
> * If the entire graph is hot in-memory across the cluster, then distributed 
> in-memory graph computing is possible. Again, no more linear-scans over 
> partitions like with Giraph/Spark/etc. ({{GraphComputer}}).
> * If transactions are worked out, then distributed OLTP Gremlin provides 
> mutation capabilities (something currently not implemented for 
> {{GraphComputer}}). That is {{addV}}, {{addE}}, {{drop}}, etc. just works. 
> **Caveate, transactions in this environment across GremlinServer seems 
> difficult.**
> So thats that. This could very well be the future of Gremlin OLAP. The 
> disjoint between OLAP and OLTP would go away, the codebase would be 
> simplified, and the computational gains in terms of performance and 
> expressivity would be great. This is a big deal idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to