Marko A. Rodriguez created TINKERPOP-1250:
---------------------------------------------

             Summary: Support Subgraph-Centric GraphComputer
                 Key: TINKERPOP-1250
                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1250
             Project: TinkerPop
          Issue Type: New Feature
          Components: process
    Affects Versions: 3.2.0-incubating
            Reporter: Marko A. Rodriguez


Right now, {{GraphComputer}} and {{VertexPrograms}} are "vertex-centric." That 
is, the boundary of an atomic unit of computation is a single vertex, its 
properties, and its incident edges. We should support "subgraph-centric" 
computing where each worker's partition is loaded in memory (RAM) as a 
connected subgraph. For those vertices that are not in its partition, a special 
"reference vertex" is use to reference it. What this would mean is that when a 
{{Traverser}} is processing, it can continue to evaluate as long as its within 
the subgraph partition. The moment it references a vertex not in the partition 
(a "reference vertex"), it serializes itself as a message.

This would greatly increase the speed of Gremlin OLAP at the cost of requiring 
a large amount of memory to store all the worker partition subgraphs in RAM. 
There might even be a way to have a hybrid-model where some of the partition is 
held in RAM and the other is (even though still in the same partition) is 
stored as "star vertices."

How would this be added to {{GraphComputer}} in a backwards compatible way?

{code}
GraphComputer.supportsVertexCentricComputing)() // currently true for all 
implementations
GraphComputer.supportsSubgraphCentricComputing()
{code}

A {{VertexProgram}} could then have the following method:

{code}
boolean VertexProgram.withinCentricity(final M message)
{code}

If the message is NOT within "centricity", then it is serialized and 
distributed, else it continues to execute.

I haven't thought through all the API and implementation considerations. Though 
it would be good to make this backwards compatible and, moving forward, able to 
support "edge-centric computing" and thus, have a very memory limited OLAP 
system.

How to think of the different models:

1. Vertex-centric: medium speed, medium expressivity, medium memory costs.
2. Subgraph-centric: high speed, high expressivity, high memory costs.
3. Edge-centric: high speed, low expressivity, low memory costs.

What is "expressivity"? Well, subgraph-centric computations can have "local 
traversals" that move beyond the "star graph." Edge-centric would not support 
any `by()`-modulators as the computation is bound to a single edge.  Thus, low 
expressivity. It really depends on how you represent the edge-centric edge 
list. Do you have the vertices on each side of an edge duplicated with all 
their properties? This wold still be low memory costs, but you could get more 
expressivity.

Anywho, in general it would be nice if the underlying execution engine can 
handle the three common distributed graph computing paradigms. Subgraph-centric 
seems the easiest to support at this point in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to