[ https://issues.apache.org/jira/browse/TINKERPOP-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169370#comment-15169370 ]
ASF GitHub Bot commented on TINKERPOP-1166: ------------------------------------------- GitHub user okram opened a pull request: https://github.com/apache/incubator-tinkerpop/pull/243 TINKERPOP-1166, TINKERPOP-1164, TINKERPOP-1057, TINKERPOP-1162 https://issues.apache.org/jira/browse/TINKERPOP-1166 https://issues.apache.org/jira/browse/TINKERPOP-1164 https://issues.apache.org/jira/browse/TINKERPOP-1057 https://issues.apache.org/jira/browse/TINKERPOP-1162 This PR has all these tickets combined because in order to solve any I needed to make sure all these ticket's respective solutions played well with one another. In summary, GraphComputer `Memory` has come to the center stage as a replacement for `MapReduce`. The benefit is that `Memory` can compute/reduce in parallel with a VertexProgram's execution run. This has the following benefits: * Reductions happen within the vertex program and thus, no subsequent rescan of the full graph needed to generate sideEffect. For example -- `g.V().count()` is one pass through the graph, not two. * Because reductions happen within the vertex program, barrier steps are no longer the cut-off point for an OLAP traversal. An OLAP traversal can now have multiple reducing barrier steps within it (e.g. `max()`, `groupCount()`, `min()`, `group()`, etc.). Its all one job. * Vertex compute keys can be marked transient and are automagically removed from the resultant graph. This is good for removing "scratch data." (e.g. PageRankVertexProgram.EDGE_COUNTS). * Memory compute keys can be marked transient and are automagically remove from the result memory. This is good for removing "scratch data." (e.g. VOTE_TO_HALT). * Memory compute keys can be declared to NOT broadcast and thus, no be sent to the workers on each iteration. Workers can still send data, but just can not read data. Finally, one of the major side-effect benefits of this work is that numerous traversals that were considered "illegal" by `ComputerVerificationStrategy` are no longer illegal. The only types of traversals that are illegal in OLAP are those that have `by()`-modulators that go beyond the local star graph or are path-based (`select()` and `path()`) and go beyond the element id in their `by()`-modulations. This work creates breaking changes for both users (trivial) and providers (intense). However, for providers, its only those providers that have their own custom `GraphComputer` implementation. If they use `SparkGraphComputer` or `GiraphGraphComputer`, no work is required of them. CHANGELOG ``` * Added `MemoryComputeKey` for specification of `Memory` keys in `VertexProgram`. (*breaking*) * Added `VertexComputeKey` for specification of vertex compute properties in `VertexProgram`. (*breaking*) * Added `and`, `or`, and `addAll` to `Operator`. * `Memory` API changed to support setting and adding values for reduction. (*breaking*) * `Memory` keys can be marked as broadcast and only those values are sent to workers on each iterator. * `Memory` keys can be marked transient and thus deleted at the end of the OLAP job. * Vertex compute keys can be marked transient and thus deleted at the end of the OLAP job. * `VertexProgram` API changed to support `MemoryComputeKey` and `VertexComputeKey`. (*breaking*) * `TraversalVertexProgram` able to execute OLAP and OLTP traversal sections dynamically within the same job. * Removed `FinalGet` interface as all post processing of reductions should be handled by the reducing step explicitly. (*breaking*) * Greatly simplified all `ReducingBarrierStep` implementations as they no longer require `MapReduce` in OLAP. * Nearly all steps in OLAP that used `MapReduce` now use `Memory` to do their reductions which expands the list of legal traversals. * `GroupStep` simplified with `GroupHelper.GroupMap` no longer being needed. Related to the removal of `FinalGet`. * OLAP side-effects that are no longer generated by `MapReduce` are simply stored in `ComputerResult.Memory` w/ no disk persistence needed. (*breaking*) ``` UPGRADE-PROVIDERS ``` GraphComputer updates to semantics and API ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Providers that have a custom `GraphComputer` implementation will have a lot to handle. Note that if the graph system simply uses `SparkGraphComputer` or `GiraphGraphComputer` provided by TinkerPop, then no updates are required. `Memory` updates: * Any `BinaryOperator` can be used for reduction and is made explicit in the `MemoryComputeKey`. * `MemoryComputeKeys` can be marked transient and must be removed from the resultant `ComputerResult.memory()`. * `MemoryComputeKeys` can be specified to not broadcast and thus, must not be available to workers to read in `VertexProgram.execute()`. * The `Memory` API has been changed. No more `incr()`, `and()`, etc. Now its just `set()` (setup/terminate) and `add()` (execute). See TINKERPOP-1166:https://issues.apache.org/jira/browse/TINKERPOP-1166 Operational semantic test cases have been added to `GraphComputerTest` to ensure that all the above behaviors are implemented correctly. Providers that have a custom `ReducingBarrierStep` implementation will need to adjust their implementation slightly to accommodate a new API that reflects the `Memory` updates above. This should be a simple change. Note that `FinalGet` no longer exists and such post-reduction processing is handled by the reducing step. See TINKERPOP-1164:https://issues.apache.org/jira/browse/TINKERPOP-1164 ``` *UPGRADE-USERS* ``` VertexProgram and MemoryComputeKey and VertexComputeKey ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Users that have custom `VertexProgram` implementations will need to change their implementations to support the new `VertexComputeKey` and `MemoryComputeKey` classes. In the `VertexPrograms` provided by TinkerPop, these changes were trivial, taking less than 5 minutes to make the updates. * `VertexProgram.getVertexComputeKeys()` returns a `Set<VertexComputeKey>`. No longer a `Set<String>`. Use `VertexComputeKey.of(String key,boolean transient)` to generate a `VertexComputeKey`. Transient keys were not supported in the past, so to make the implementation semantically equivalent, the boolean transient should be false. * `VertexProgram.getMemoryComputeKeys()` returns a `Set<MemoryComputeKey>`. No longer a `Set<String>`. Use `MemoryComputeKey.of(String key, BinaryOperator reducer, boolean broadcast, boolean transient)` to generate a `MemoryComputeKey`. Broadcasting and transients were not supported in the past so to make the implementation semantically equivalent, to boolean broadcast should be true and the boolean transient should be false. See TINKERPOP-1162:https://issues.apache.org/jira/browse/TINKERPOP-1162 *SparkGraphComputer and GiraphGraphComputer Persistence* ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Most of the `MapReduce`-based steps in `TraversalVertexProgram` have been removed and replaced using the new `Memory`-reduction model. `MapReduce` jobs always created a persistence footprint (e.g. in HDFS). Memory data was never persisted to HDFS. As such, there will not be data on the disk that is accessible. For instance, no more `~reducing`, `~traversers`, and specially named side-effects such as those from `groupCount('m')`. The data is still accessible via `ComputerResult.memory()`, its just simply does not have a corresponding on-disk representation. ``` *CHANGELOG* ``` * Added `MemoryComputeKey` for specification of `Memory` keys in `VertexProgram`. (*breaking*) * Added `VertexComputeKey` for specification of vertex compute properties in `VertexProgram`. (*breaking*) * Added `and`, `or`, and `addAll` to `Operator`. * `Memory` API changed to support setting and adding values for reduction. (*breaking*) * `Memory` keys can be marked as broadcast and only those values are sent to workers on each iterator. * `Memory` keys can be marked transient and thus deleted at the end of the OLAP job. * Vertex compute keys can be marked transient and thus deleted at the end of the OLAP job. * `VertexProgram` API changed to support `MemoryComputeKey` and `VertexComputeKey`. (*breaking*) * `TraversalVertexProgram` able to execute OLAP and OLTP traversal sections dynamically within the same job. * Removed `FinalGet` interface as all post processing of reductions should be handled by the reducing step explicitly. (*breaking*) * Greatly simplified all `ReducingBarrierStep` implementations as they no longer require `MapReduce` in OLAP. * Nearly all steps in OLAP that used `MapReduce` now use `Memory` to do their reductions which expands the list of legal traversals. * `GroupStep` simplified with `GroupHelper.GroupMap` no longer being needed. Related to the removal of `FinalGet`. * OLAP side-effects that are no longer generated by `MapReduce` are simply stored in `ComputerResult.Memory` w/ no disk persistence needed. (*breaking*) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1166 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-tinkerpop/pull/243.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #243 ---- commit b50a43ce781572a1610fa3e31b5132205796af67 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-19T22:48:50Z Migrated over to the proposed Memory model of using registered BinaryOperator reducers. It was really easy to change so thats good. All test cases pass for TinkerGraphComputer, one fails in SparkGraphComputer, and I have some NullPointer serialiation issue with GiraphGraphComputer that I will fix later. commit 0ae584cae51ab15eef7de776cf8c049b64ace852 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-19T23:45:18Z couldn't help myself. once I walked away from the computer I realized how to make GiraphMemory work. Got it working and test cases passing. Only one test fails in SparkGraphComputer. I will handle THAT next week. commit cf79c2255028d3a2a7dabb4198030b50bfb65417 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-20T02:35:52Z minor tweak. commit 7bf8213a5e27026f9a378a0eb166f3a67038321f Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-20T17:03:25Z fixed the SparkMemory issue from last night. Wow. I was really expecting this ticket to take me all of next week. Knocked it out before the week even began. Bow to me all you peons. commit 706759c1dbf9df6bf9210913001658ca9f9ff513 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-22T16:50:06Z added GraphComputerTest.shouldSupportTransientKeys(). Ensured that only Memory.set() is allowed in setup()/terminate() and Memory.add() in execute(). Fixed up SparkGraphComputer, TinkerGraphComputer, and GiraphSparkComputer to respect the new MemoryComputeKey semantics and transient key semantics. commit b03c1adc466fa48fffcfeb7e17b71a243f66b76f Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-22T17:38:50Z EDGE_COUNT and VOTE_STRENGTH in PageRankVertexProgram and PeerPressureVertexProgram are not transient and the respective property keys are private static. Extended GraphComputerTest.shouldSupportTransientKeys() with a MapReduce job that ensure that the transient vertex properties are not accessible during MapReduce. commit 65bbdaa336b4034421f7e2599bb3f2c307aa4773 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-22T21:41:02Z Have all the ReducingBarrierSteps no longer implement MapReduce and using MemoryComputeKeys for their reduction. GroupStep, FoldStep, and GroupStepV3d0 are not complete yet. Having the darndest time with GroupStep -- once I get it, then the others will follow from it. Pushing to save work. commit 3389122591a788ceb566f9bdf5a1ba6b72789faa Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-22T22:15:21Z Got GroupStep working --- buts its a hack unfortunately. Pushing to save work. commit 9f8252b816432249a0667e654b682293b77ec3c1 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-23T00:21:48Z Fixed a bug in FoldStep, got GroupStep working perfectly (both OLTP and OLAP -- but there is still one awakward hack). Need to spend some more cycles on GroupStep and then once I get that clean, clean, clean, I will map that pattern over to GroupStepV3d0 and that will be that. commit 90debb4fed3bffa5833543566620bcf0a30ae7f4 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-23T01:15:09Z some more work on GroupStep ---- converging on final solution. commit 9416e9498a363e8ca1ef69df6cc0d046762e43e3 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-23T17:25:36Z New Reducing-based Memory-model implemented. A few kooky things emerged because of this and will discuss in UPGRADE docs. However, all-in-all this is a much nicer model which will lead to significant perofmrance improvements (still need to benchmark and test). I don't think we will ever deprecate the MapReduce infrastructure. The Memory model is not as flexible and efficient (for certain jobs) as MapReduce. commit d77bf58c8e60f52f180af21ef18091c3cb143c3b Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-23T20:40:04Z Solved the GroupStep-hack. The whole notion of a FinalGet is bad. Every ReducingBarrier step now implements generateFinalReduction() with the default being the identity function. For GroupStep, MeanStep, and GroupStepV3d0, they do the final respective reduction. Also, GroupStep is a bit more organized -- only one BinaryOperator needed. commit 14d0d48435640572523d942d56b9a7a6dd128969 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-23T23:21:02Z The next big thing in the new MemoryComputeKey model -- broadcasting. It is possible to state that a MemoryKey will NOT be read by workers and thus, no need to send the data to the workers on each iteration. Added GraphComputerTest.shouldSupportBroadcastKeys(). SparkGraphComputer supports this natively, TinkerGraphComputer simply hides the data when trying to be accessed by workers, and GiraphGraphComputer (like TinkerGraphComputer) but I will be able to at least clear the data immediately once its sent (future work). Cleaned up GroupStep a bit. Have a consistent naming convention for workers vs. master --- inExecute. All XXXMemory implementations use it so its easier to see how they all relate to each other. Added a GraphComputerTest to make sure exceptions are correct around get(), add(), set() for various situations -- found a couple of inconsistencies that are now fixed up. commit 90e862d76ee23d8e03f1b954be4276cd09f707ed Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-23T23:21:35Z Merge branch 'master' into TINKERPOP-1166 commit e6adbecfc401c4f41705216fecfac74ff04b5c8e Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-24T00:37:23Z Needed a no-args constructor for FoldBiOperator in GiraphGraphComputer. commit aa79390f7afd78091755784753ed67a311f1d7da Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-24T20:52:29Z Okay. So MapReduce in TraversalVertexProgram is going away fast. GroupSideEffectStep and GroupStep are now one step -- GroupStep. Likewise for GroupCountStep. groupCount() is simply groupCount().cap(). This makes the code alot simpler and easier to optimize everything in one spot. With this direction, TraversalVertexProgram will be able to do OLAP ... then in terminate(), do reductions. If those reductions yield traversers, termiate() == false, and we distributed messages again. Thus, we will have OLTP->OLAP->OLTP->OLAP all possible within in a single TraveraslVertexProgram. The idea is that the master worker (termiante()) will do OLTP processing until it needs to go back to OLAP (if necessary). There is still lots more work to do. This push is to save work. commit 980525cd2d7ee656858f68703635972c740486fa Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-24T22:13:09Z StoreStep no longer uses MapReduce. All that is left is AggregateStep and ProfileStep. commit 2f3b2ac971ade400b4f2bfb794b451a612ed0f69 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-25T15:29:23Z out of my 24 hole. @dkuppitz -- I was wrong, you can't have XXXStep and XXXSideEffectStep as one in the same. I learned exactly what I learned about a year ago by trying. However, note that all GroupCountXXXStep and GroupXXXStep are no longer MapReduce-based, but GraphComputer.Memory-based and because of their alignment they share lots of the same code. All that is left is TreeStep, ProfileStep, and AggregateStep to convert to the GraphComputer.Memory model. commit aae8910388873f7a749b3df330fca05ce7b5d2eb Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-25T15:53:09Z Got rid of the FinalGet concept. This won't work moving forward. Each step is responsible for its final reduction via GraphComputing.generateFinalReduction(). TreeStep and TreeSideEffectStep now both use the GraphComputer.Memory model. All that is left is AggregateStep and ProfileStep. commit d1812f806cd360ce5716a620413a28938ef80ab6 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-25T17:40:40Z RangeGlobalStep is now computing in TraversalVertexProgram. Will slowly dissect away the end-steps of TraverserMapReduce such that we have a pure OLTP/OLAP model within TraversalVertexProgram. commit e6d9d6d2581dd3ef4c32d336df0eb7a8989bbaba Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-25T17:53:18Z TailGlobalStep is no longer an OLAP end step. CollectingBarriers are up next -- Aggregate, Order, Dedup. commit c4b9e741421d2af895bdf3c37166f952209d72d4 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-25T18:37:44Z All the ReducingBarrierSteps inherent their MemoryComputeKey from abstract ReducingBarrierStep. There key is now step.getId(), not longer ReducingBarrierStep.REDUCING as you can now have multiple reducers in the same OLAP job. Updated ComputerResultStep accordingly. Next up will be such that ComputerResultStep does not do ANY introspection into TraversalVertexProgramStep -- it simply pulls ~traversers from GraphComputer.Memory. And thats all there is to it. commit 6950b198aff0d1fb4842d972bc145a6eeae539b2 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-25T21:46:23Z Gremlin OLAP can now do all the same traversals as Gremlin OLTP. In fact, a Gremlin OLAP job is a undulation between distributed traversers and localized traversers. Its really neat -- it like 'breathes'. Fan out, reduce, fan out, reduce.... I still have to convert over OrderGlobalStep so its not done done, but yea. Saving work. commit 32dd068291e45179648340745c882df459a8115a Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-26T00:04:55Z Okay. Here is the mother load. OrderGlobalStep, DedupGlobalStep, etc. can now exist anywhere in an OLAP traversal. There are some loose ends that still need to be cleanup (as well as some major code reorg and compression), but this is the stuff. This has been a long time coming. This new GraphComputer.Memory model is sooooo much more efficient and gives us much more expressivity. Stoked. commit 066cc218bb90a27f1ee016b4db3016a630f00f63 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-26T01:17:16Z limit(1) is now compiled into the TraversalVertexProgramStep job. Forgot to update this test. commit 42b7254505563eaa1925639446cbb55ad708c7b7 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-26T15:45:14Z I totally forgot about @dkuppitz Operator.java work. I gutted lots of rewritten operators and now just uses Operator. This makes less classes to register with GryoMapper -- phew. It was getting insane. Also, in the future, if we want to make more optimal implementations, we can just add stuff like Operator.sumLong() and it doesn't effect serialization because Operator is an enum. commit efc1739f183ff8f30be54b53c393e3ba8d03ced9 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-26T16:33:33Z tada -- fixed my groupCount()...groupCount()... bug. Added a crazy asynchronous traversal to groupCount() that demonstrates that the timing is correct for OLAP to OLTP to OLAP conversion. I went through all the test case that have Ignore over a test and remove lots of them. Some didnt even make sense why we had them Ignored. There are only a few Ignore(COMPUTER) and they make sense... (not related to this PR, but in general issues with serialization or whatnot). This new OLAP work is sooooo slammin. commit 958791792b3e48b7c983fb5a54ae21671e1154f6 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-26T17:11:38Z renamed Operator.add() to Operator.addAll() as its more clear especially since Operator.sum() exists. ---- > Add Memory.reduce() as option to Memory implementations. > -------------------------------------------------------- > > Key: TINKERPOP-1166 > URL: https://issues.apache.org/jira/browse/TINKERPOP-1166 > Project: TinkerPop > Issue Type: Improvement > Components: hadoop, process, tinkergraph > Affects Versions: 3.1.2-incubating > Reporter: Marko A. Rodriguez > Labels: breaking > > Currently {{Memory}} supports {{incr}}, {{and}}, {{or}}, ... These are great > and what people will typically use. However, we should also provide the > generalization which is simply {{Memory.reduce}}. In this situation, > {{incr}}, {{or}}, {{and}}, etc. are just specifications of {{Memory.reduce}}. > How would it work? > When memory is initialized in a {{VertexProgram}}, it would be like this: > {code} > memory.set("myReduction", new MyReducingFunction(0)) > {code} > Then {{ReducingFunction}} would look like this: > {code} > public class ReducingFunction implements UnaryOperator<A> { > public A getInitialValue(); > public A apply(A first, A second); > } > {code} > Easy peasy. Note that both Spark and Giraph support such types of > function-based reduction in their respective "memory engines." > TinkerGraphComputer will, of course, be easy to add this functionality too. > Why do this? For two reasons: > 1. We get extra flexibility in {{Memory}}. > 2. https://issues.apache.org/jira/browse/TINKERPOP-1164 -- This message was sent by Atlassian JIRA (v6.3.4#6332)