[ https://issues.apache.org/jira/browse/TINKERPOP-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228815#comment-15228815 ]
pieter martin commented on TINKERPOP-1250: ------------------------------------------ Is the idea to use the current {{PartitionStrategy}} to define the partition boundaries? > Support Subgraph-Centric GraphComputer > -------------------------------------- > > Key: TINKERPOP-1250 > URL: https://issues.apache.org/jira/browse/TINKERPOP-1250 > Project: TinkerPop > Issue Type: Improvement > Components: process > Affects Versions: 3.2.0-incubating > Reporter: Marko A. Rodriguez > Labels: breaking > > Right now, {{GraphComputer}} and {{VertexPrograms}} are "vertex-centric." > That is, the boundary of an atomic unit of computation is a single vertex, > its properties, and its incident edges. We should support "subgraph-centric" > computing where each worker's partition is loaded in memory (RAM) as a > connected subgraph. For those vertices that are not in its partition, a > special "reference vertex" is use to reference it. What this would mean is > that when a {{Traverser}} is processing, it can continue to evaluate as long > as its within the subgraph partition. The moment it references a vertex not > in the partition (a "reference vertex"), it serializes itself as a message. > This would greatly increase the speed of Gremlin OLAP at the cost of > requiring a large amount of memory to store all the worker partition > subgraphs in RAM. There might even be a way to have a hybrid-model where some > of the partition is held in RAM and the other is (even though still in the > same partition) is stored as "star vertices." > How would this be added to {{GraphComputer}} in a backwards compatible way? > {code} > GraphComputer.supportsVertexCentricComputing)() // currently true for all > implementations > GraphComputer.supportsSubgraphCentricComputing() > {code} > A {{VertexProgram}} could then have the following method: > {code} > boolean VertexProgram.withinCentricity(final M message) > {code} > If the message is NOT within "centricity", then it is serialized and > distributed, else it continues to execute. > I haven't thought through all the API and implementation considerations. > Though it would be good to make this backwards compatible and, moving > forward, able to support "edge-centric computing" and thus, have a very > memory limited OLAP system. > How to think of the different models: > 1. Vertex-centric: medium speed, medium expressivity, medium memory costs. > 2. Subgraph-centric: high speed, high expressivity, high memory costs. > 3. Edge-centric: high speed, low expressivity, low memory costs. > What is "expressivity"? Well, subgraph-centric computations can have "local > traversals" that move beyond the "star graph." Edge-centric would not support > any `by()`-modulators as the computation is bound to a single edge. Thus, > low expressivity. It really depends on how you represent the edge-centric > edge list. Do you have the vertices on each side of an edge duplicated with > all their properties? This wold still be low memory costs, but you could get > more expressivity. > Anywho, in general it would be nice if the underlying execution engine can > handle the three common distributed graph computing paradigms. > Subgraph-centric seems the easiest to support at this point in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)