[DISCUSS] Getting GraphComputer Memory straight for TinkerPop 3.3.0.

Marko Rodriguez Tue, 24 May 2016 12:36:13 -0700

Hello,

For TinkerPop 3.3.0, I think we should clean some things up in GraphComputer. 
This desire was started by Kuppitz using program() in complex ways and 
realizing awkwardnesses that I believe we should fix. In particular:


        https://issues.apache.org/jira/browse/TINKERPOP-1309
        https://issues.apache.org/jira/browse/TINKERPOP-1306

How do we do this?

        1. Configuration and Memory should always play together to ensure that 
job chaining works.
                * memory = new ConfigurationMemory(configuration)
                * memory.store(configuration)
        2. In Hadoop, Memory should be persisted as such:
                * hdfs.ls("output")
                    ==>graph
                    ==>memory
                * This then perfectly reflects the ComputerResult return which 
is basically a Pair<Graph,Memory>
                * This means, while we are breaking things, ~g as the directory 
should go away, it should just be graph. (this is historic from when g was 
graph).
                * We should provide ComputerResult result = 
Storage.result("output").
        3. We should deprecate the MapReduce API and simply use Memory.
                * TraversalVertexProgram (arguably the most complex 
VertexProgram) no longer uses the MapReduce API, all reductions are via Memory.
                * Everything will then just be Graph (distributed/workers) or 
Memory (local/master).

This will help us to clean up a bunch of ambiguity in the API. Everything in 
OLAP is just about Graph, Memory, and VertexProgram. VertexPrograms are able to 
access previous Memory representations via Configuration (for OLAP chains). 
This makes sense since VertexProgram.load() takes two arguments --- Graph and 
Configuration, where Memory is a subset of the properties in Configuration. The 
MapReduce API was added to allow people to post process their graph after a 
VertexProgram had executed. However, Memory is more powerful -- it can be 
modulated at each iteration and it can broadcast results throughout the 
cluster. The drawback, Memory can only be used for data that can be stored on a 
single machine -- counters, reductions, etc. While MapReduce stored results in 
a distributed manner, the drawback, there was no way to broadcast results and 
thus, it was only useful after a computation, not during. I think we make a 
hard break and get rid of MapReduce.

Finally, while VertexPrograms have nothing to do with Traversals, Traversals 
are fundamental to TinkerPop. With that, I think we should have helpers like:

        traversal = ConfigurationTraversal.load(configuration)
        ConfigurationTraversal.store(configuration, traversal)
        ConfigurationTraversal.synchronizeSideEffectsAndMemory(traversal, 
memory)

This way Traversals are populated through Configurations (as they currently 
are), but we make it easy for people to get the traversal, configure the memory 
that will be used for traversal sideEffects, etc.

In short, with some cleanup and thought, we should be able to make it easier 
for people to write complex VertexProgram chains without a lot of nasty 
serialization/configuration boilerplate.

Thoughts?,
Marko.
        
http://markorodriguez.com

[DISCUSS] Getting GraphComputer Memory straight for TinkerPop 3.3.0.

Reply via email to