Hello,
For TinkerPop 3.3.0, I think we should clean some things up in GraphComputer.
This desire was started by Kuppitz using program() in complex ways and
realizing awkwardnesses that I believe we should fix. In particular:
https://issues.apache.org/jira/browse/TINKERPOP-1309
https://issues.apache.org/jira/browse/TINKERPOP-1306
How do we do this?
1. Configuration and Memory should always play together to ensure that
job chaining works.
* memory = new ConfigurationMemory(configuration)
* memory.store(configuration)
2. In Hadoop, Memory should be persisted as such:
* hdfs.ls("output")
==>graph
==>memory
* This then perfectly reflects the ComputerResult return which
is basically a Pair<Graph,Memory>
* This means, while we are breaking things, ~g as the directory
should go away, it should just be graph. (this is historic from when g was
graph).
* We should provide ComputerResult result =
Storage.result("output").
3. We should deprecate the MapReduce API and simply use Memory.
* TraversalVertexProgram (arguably the most complex
VertexProgram) no longer uses the MapReduce API, all reductions are via Memory.
* Everything will then just be Graph (distributed/workers) or
Memory (local/master).
This will help us to clean up a bunch of ambiguity in the API. Everything in
OLAP is just about Graph, Memory, and VertexProgram. VertexPrograms are able to
access previous Memory representations via Configuration (for OLAP chains).
This makes sense since VertexProgram.load() takes two arguments --- Graph and
Configuration, where Memory is a subset of the properties in Configuration. The
MapReduce API was added to allow people to post process their graph after a
VertexProgram had executed. However, Memory is more powerful -- it can be
modulated at each iteration and it can broadcast results throughout the
cluster. The drawback, Memory can only be used for data that can be stored on a
single machine -- counters, reductions, etc. While MapReduce stored results in
a distributed manner, the drawback, there was no way to broadcast results and
thus, it was only useful after a computation, not during. I think we make a
hard break and get rid of MapReduce.
Finally, while VertexPrograms have nothing to do with Traversals, Traversals
are fundamental to TinkerPop. With that, I think we should have helpers like:
traversal = ConfigurationTraversal.load(configuration)
ConfigurationTraversal.store(configuration, traversal)
ConfigurationTraversal.synchronizeSideEffectsAndMemory(traversal,
memory)
This way Traversals are populated through Configurations (as they currently
are), but we make it easy for people to get the traversal, configure the memory
that will be used for traversal sideEffects, etc.
In short, with some cleanup and thought, we should be able to make it easier
for people to write complex VertexProgram chains without a lot of nasty
serialization/configuration boilerplate.
Thoughts?,
Marko.
http://markorodriguez.com