Re: [DISCUSS] Getting GraphComputer Memory straight for TinkerPop 3.3.0.

Marko Rodriguez Tue, 31 May 2016 06:57:06 -0700

Hi,

>> We should deprecate the MapReduce API and simply use Memory.
> 
> You mentioned that these changes were for TinkerPop 3.3.x. We've done a
> really good job imo of controlling breaking changes. For the eventual 3.3.x
> line (don't know when we want to consider starting that), I think we should
> make that our opportunity to remove a lot of the stuff we deprecated over
> the post-GA releases. I suppose that should be a separate thread of
> discussion, but I mention it because if that ends up being our intent and
> we have further intent to deprecate things like the MapReduce API, we
> should look to deprecate those things in the 3.2.x/3.1.x lines now so that
> we can go into 3.3.x without any new @deprecated stuff (we would just have
> our breaking changes start there with no dead code lying around). Should we
> shape our 3.3.x strategy around that approach?


We will alway have @Deprecations coming in. I don't think we will be able to 
say "TinkerPop 3.3.x is clean and clear of all deprecations." Next, the 
MapReduce API is pretty core and to just deprecate it now in a mid-major line, 
I believe, is bit extreme. For 3.3.x, I want to be able to show (in the docs) 
how to do the MapReduce stuff via Memory and explain such things more in-depth.

Marko.

http://markorodriguez.com



> 
> 
> On Tue, May 24, 2016 at 3:35 PM, Marko Rodriguez <[email protected]>
> wrote:
> 
>> Hello,
>> 
>> For TinkerPop 3.3.0, I think we should clean some things up in
>> GraphComputer. This desire was started by Kuppitz using program() in
>> complex ways and realizing awkwardnesses that I believe we should fix. In
>> particular:
>> 
>>        https://issues.apache.org/jira/browse/TINKERPOP-1309
>>        https://issues.apache.org/jira/browse/TINKERPOP-1306
>> 
>> How do we do this?
>> 
>>        1. Configuration and Memory should always play together to ensure
>> that job chaining works.
>>                * memory = new ConfigurationMemory(configuration)
>>                * memory.store(configuration)
>>        2. In Hadoop, Memory should be persisted as such:
>>                * hdfs.ls("output")
>>                    ==>graph
>>                    ==>memory
>>                * This then perfectly reflects the ComputerResult return
>> which is basically a Pair<Graph,Memory>
>>                * This means, while we are breaking things, ~g as the
>> directory should go away, it should just be graph. (this is historic from
>> when g was graph).
>>                * We should provide ComputerResult result =
>> Storage.result("output").
>>        3. We should deprecate the MapReduce API and simply use Memory.
>>                * TraversalVertexProgram (arguably the most complex
>> VertexProgram) no longer uses the MapReduce API, all reductions are via
>> Memory.
>>                * Everything will then just be Graph (distributed/workers)
>> or Memory (local/master).
>> 
>> This will help us to clean up a bunch of ambiguity in the API. Everything
>> in OLAP is just about Graph, Memory, and VertexProgram. VertexPrograms are
>> able to access previous Memory representations via Configuration (for OLAP
>> chains). This makes sense since VertexProgram.load() takes two arguments
>> --- Graph and Configuration, where Memory is a subset of the properties in
>> Configuration. The MapReduce API was added to allow people to post process
>> their graph after a VertexProgram had executed. However, Memory is more
>> powerful -- it can be modulated at each iteration and it can broadcast
>> results throughout the cluster. The drawback, Memory can only be used for
>> data that can be stored on a single machine -- counters, reductions, etc.
>> While MapReduce stored results in a distributed manner, the drawback, there
>> was no way to broadcast results and thus, it was only useful after a
>> computation, not during. I think we make a hard break and get rid of
>> MapReduce.
>> 
>> Finally, while VertexPrograms have nothing to do with Traversals,
>> Traversals are fundamental to TinkerPop. With that, I think we should have
>> helpers like:
>> 
>>        traversal = ConfigurationTraversal.load(configuration)
>>        ConfigurationTraversal.store(configuration, traversal)
>>        ConfigurationTraversal.synchronizeSideEffectsAndMemory(traversal,
>> memory)
>> 
>> This way Traversals are populated through Configurations (as they
>> currently are), but we make it easy for people to get the traversal,
>> configure the memory that will be used for traversal sideEffects, etc.
>> 
>> In short, with some cleanup and thought, we should be able to make it
>> easier for people to write complex VertexProgram chains without a lot of
>> nasty serialization/configuration boilerplate.
>> 
>> Thoughts?,
>> Marko.
>> 
>> http://markorodriguez.com
>> 
>>

Re: [DISCUSS] Getting GraphComputer Memory straight for TinkerPop 3.3.0.

Reply via email to