[ https://issues.apache.org/jira/browse/TINKERPOP-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145454#comment-15145454 ]
ASF GitHub Bot commented on TINKERPOP-1140: ------------------------------------------- GitHub user okram opened a pull request: https://github.com/apache/incubator-tinkerpop/pull/227 TINKERPOP-1140: TraversalVertexProgramStep in support of OLAP/OLTP conversions. https://issues.apache.org/jira/browse/TINKERPOP-1140 `Traversals` can now be composed of multiple OLAP jobs. This ticket in particular introduced `TraversalVertexProgramStep` which is responsible for executing a OLAP job and then sending its results to the next OLAP step (implementing a `VertexComputing`) or, it just drains the results serially to the next OLTP step. To demonstrate this working, I added `PageRankVertexProgramStep` (experimental for now -- a future commit will make all of this more explicit and include `PeerPressureVertexProgramStep`. This ticket is particular to `TraversalVertexProgramStep`). Here is the glory. ``` gremlin> g = TinkerFactory.createModern().traversal().withComputer() ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], tinkergraphcomputer] gremlin> g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap() ==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[peter], gremlin.pageRankVertexProgram.edgeCount:[1.0], age:[35]] ==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[marko], gremlin.pageRankVertexProgram.edgeCount:[3.0], age:[29]] ==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], name:[josh], gremlin.pageRankVertexProgram.edgeCount:[2.0], age:[32]] ==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], name:[vadas], gremlin.pageRankVertexProgram.edgeCount:[0.0], age:[27]] ==>[gremlin.pageRankVertexProgram.pageRank:[0.23181250000000003], name:[ripple], gremlin.pageRankVertexProgram.edgeCount:[0.0], lang:[java]] ==>[gremlin.pageRankVertexProgram.pageRank:[0.4018125], name:[lop], gremlin.pageRankVertexProgram.edgeCount:[0.0], lang:[java]] gremlin> g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap().explain() ==>Traversal Explanation ======================================================================================================================================================================================================================================== Original Traversal [GraphStep([],vertex), PageRankVertexProgramStep([VertexStep(OUT,edge)]), OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))]), PropertyMapStep(value)] ConnectiveStrategy [D] [GraphStep([],vertex), PageRankVertexProgramStep([VertexStep(OUT,edge)]), OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))]), PropertyMapStep(value)] VertexProgramStrategy [D] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] IdentityRemovalStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] FilterRankingStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] IncidentToAdjacentStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] AdjacentToIncidentStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] MatchPredicateStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] RangeByIsCountStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] TinkerGraphStepStrategy [P] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] ProfileStrategy [F] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] StandardVerificationStrategy [V] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] ComputerVerificationStrategy [V] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] Final Traversal [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] gremlin> gremlin> g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap().toString() ==>[GraphStep([],vertex), PageRankVertexProgramStep([VertexStep(OUT,edge)]), OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))]), PropertyMapStep(value)] gremlin> g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap().iterate().toString() ==>[PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] ``` CHANGELOG: ``` * Fixed a bug in both `SparkGraphComputer` and `GiraphGraphComputer` regarding source data access in job chains. * Significantly extended job chaining test coverage for `GraphComputer` providers. * Added `TraversalHelper.onGraphComputer(traversal)`. * `MapReduce.map()` no longer has a default implementation. This method must be implemented. * `TraversalVertexProgram` can work without a `GraphStep` start. * Fixed a severe bug in `TraverserMapReduce.combine()`. * Added `PageRankVertexProgramStep` and `GraphTraversal.pageRank()`. * Simplified the comparator model in `OrderGlobalStep` and `OrderLocalStep`. ``` The only note for the upgrade docs will be that `MapReduce.map()` no longer have a default implementation. However, if you didn't implement it, your `MapReduce` job would instantly fail so I don't think anyone has code like that. Oh, and `TraversalHelper.onGraphComputer()` will replace the "stub method" I had in the previous PR with `TraversalStrategies.onGraphComputer()`. Thus, minor nothings to upgrade docs. *** NOTE: I will add `pageRank()` to the reference doc in a future 3.2.0 commit when I do another ticket that is more specific to `PageRankVertexProgramStep` and `PeerPressureVertexProgram`. VOTE +1. `mvn clean install` and integration tests -- (though Giraph is currently still running -- but I have isolated and verified the two tests I added to the `ProcessComputerSuite`). You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1140 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-tinkerpop/pull/227.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #227 ---- commit 64cd1b18df1170e96d61c0e68cc6495d32c7fa3e Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-09T22:29:31Z TraversalVertexProgramStep created. It works. Sorta --- need to get ComputerVerificationStrategy more aware. 95 percent of tests pass -- the ones that fail have to do with deep nested traverasls and ComputerVerificationStrategy not catching them. Anywho, once this is solid, the door is open for PageRankVertexProgramStep. Going to get crazy here soon. Also, ComputerResultStep now simply traversifies ComputerResults. Really elegant. commit f2dbb0f2094bb7044e3c1f8317fc2153c9010b2f Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-10T18:11:35Z Have TraversalVertexProgramStep working fully. I had to Ignore tests in ExplainTest and ProfileTest as OLTP and OLAP now no longer compile to the same representation. This is both good and bad. Its good because that means that a Traversal is not globally OLAP or globally OLTP. It can have OLAP and OLTP sections. Its bad in that features like Profile and Explain need to be smart about mixed-engine traversals. I am still far from finished ... just want to push to save work. commit e5c3d784e2db06febd4da1d079712336818fd0ca Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-10T18:59:04Z fixed a bug around Serialization of GraphComputer. Integration tests now pass. Cool. commit 6cbd8a5da0b8dad4f675c586a809e661a902076b Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-10T19:36:20Z Merge branch 'TINKERPOP-971' into TINKERPOP-1140 commit 080863a5837d94417b9f9fc17d0476defab56e63 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-10T22:23:02Z okay. TraversalVertexProgramStep is just like any other step. Nothing 'special'. Its a TraversalParent with one global child -- the traversal to be executed on the GraphComputer. I got rid of TraversalSideEffects.onGraphComputer() as again, a traversal can now have OLTP and OLAP components. In its place, we have TraversalHelper.onGraphComputer(traversal) which will tell you if a parent of the traversal is a TraversalVertexProgramStep. this model is really really nice as it all falls within the natural construts of our Traversal API -- no new interfaces needed, no weird two sets of strategies, no weird two sets of sideEffects... Its really clean. commit 3afc1894a4805e82ad076535697add3baac91c6c Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-11T01:41:07Z Stubbed PageRankVertexProgramStep. Cleaned up and optimized a few tid bits here and there. Its going to get nasty tomorrow. commit a4cf1514d95dfd873675f08e6fd5d08aea810992 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-11T19:39:30Z Okay. Here is is. g.V().pageRank().order.by(pageRank).name.limit(10). We know support chained OLAP jobs within a single Traversal. This is an amazing piece of code. There is still lots more to do, but the basic framework is working and is sound. Note that we were able to loosen up lots of ComputerVerificaitonStrategy restrictions. Moreover, over the next couple of pushes, we will loosen up even more. The OrderStep comparator stuff was a rats nest that I have since cleaned up ... sometimes it wants a traverser, sometimes an object...random. I can't believe we haven't seen more bugs cause of what we had. commit e2e38e691ee20ba697c2e737e678376a10f87da2 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-11T19:46:28Z Merge branch 'master' into TINKERPOP-1140 commit 17bfdb82ce89e24295940ad1926ae95e17eea821 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-11T21:32:30Z Was really dumb about SparkMessenger and GiraphMessenger. No need to insert a start step. Weird. ComputerVerificationStrategy can now recognize when edge properties are being manipulated in MapReduce. Added TraveralHelper.getLastElementClass(traversal). Neato. Fixed up FileSystem access in SparkGraphComputer. I think we have a bug in 3.1.2 that makes chaining files bad. Weird we don't have a test for that. However, most people don't chain vertex programs so I suspect no one will notice. 3.2.0 have it down pat as you need solid chaining the new multi-OLAP compiler to work. commit 66d86cdfbb00bf1eb74f067815181f90b40d6feb Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-11T21:53:00Z there is a bug in GiraphGraphComputer when chaining jobs together. damn. I have things a bit cleaner. committing what I have for now. commit cd1a37946848d28ad33d464fedac1655cadd5320 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-12T18:57:05Z Found a couple of bugs around job chaining in all GraphComputer implementations. Wrote a solid GraphComputerTest to ensure job chaining behaves as expected. Still a few more things to handle -- the Traversal.pageRank() tests in Giraph don't pass and I'm not sure why. commit 0a2243037792854c6ef9f834f41785e7a728ee7e Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-12T21:02:46Z Found a bug in TraverserMapReduce where the combiner should not just call reduce(). Only Giraph was able to expose it because it actually calls combine(). The test dataset is so small that Spark doesn't even kick it off. TinkerGraph, because its in memory, doesn't use combine. Wow. That was a two day bug. However, I have written so many more test cases. Things are really pretty. And that ends my week with g.V.pageRank().order().by().limit(10) working. Phew. commit 491bef66112d75d42cde87ae000e63587e9b3bfe Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-12T22:15:18Z added another PageRankTest test case and made MapReduce.map() a non-default implementation so users know they have to implement that method. commit d627f9c06cd24b9c12d0938e3ebff20a39ae33b9 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-12T22:31:12Z minor nothing. ---- > TraversalVertexProgramStep in support of OLAP/OLTP conversions. > --------------------------------------------------------------- > > Key: TINKERPOP-1140 > URL: https://issues.apache.org/jira/browse/TINKERPOP-1140 > Project: TinkerPop > Issue Type: Improvement > Components: process > Affects Versions: 3.1.0-incubating, 3.1.1-incubating, 3.1.2-incubating > Reporter: Marko A. Rodriguez > Fix For: 3.2.0-incubating > > > This was moved from TINKERPOP-971. TINKERPOP-971 was about TraversalSource > fluency. This is a feature we can now do because of that work. Isolated this > feature in a new ticket. > -------------- > Check this idea out: > {code} > g = graph.traversal().withComputer(graph -> > graph.compute(SparkGraphComputer.class).workers(10)); > g.V().out().values('name') > {code} > This would compile to: > {code} > [TraversalVertexProgramStep([GraphStep,VerticesStep,PropertiesStep]),ComputerResultStep] > {code} > where, > {code} > TraversalVertexProgramStep<Graph,ComputerResult> > ComputerResultStep<ComputerResult,E> > {code} > Next, check this: > {code} > g = graph.traversal().withComputer(graph -> > graph.compute(SparkGraphComputer.class).workers(10)); > g.V().hasLabel('person').pageRank(out('knows')).order().by('page.rank',decr) > {code} > This will compile to: > {code} > [TraversalVertexProgramStep([GraphStep,HasStep]),PageRankVertexProgramStep([VerticesStep]),TraversalVertexProgramStep([OrderStep]),ComputerResultStep] > {code} > How does this work?! > * TraversalVertexProgramStep will pass a {{ComputerResult}} to > PageRankVertexProgram. > ** The ComputerResult.graph() will have HALTED_TRAVERSERS properties on all > the person vertices (as that is what was computed by the vertex program). > * PageRankVertexProgram will then pass a {{ComputerResult}} to the next > TraversalVertexProgram. > ** That ComputerResult.graph() will have HALTED_TRAVERSER on all the person > vertices and PAGE_RANK on all the vertices. > * TraversalVertexProgram will then execute OrderStep which sorts the > person-vertices with HALTED_TRAVERSERS on them by their page.rank properties > computed previously. > * ComputerResultStep will then take get the ComputerResult.memory.reducing > and iterate it out. > {{ComputerResultStep}} (like now) is smart about either pulling from the > graph or from the sideEffect memory. What makes things different from now is > that {{TraversalVertexProgramStep}} will come into existence and pull out the > respective logic from {{ComputerResultStep}}. In this way, It will be > possible to chain {{{XXXVertexProgramStep}}-steps and thus, have a traversal > that is multiple OLAP jobs in sequence and thus, this ticket will fall > naturally from this https://issues.apache.org/jira/browse/TINKERPOP-570. The > whole trick to all of this is that (as we currently do) we save the state of > the computation in the graph (and memory) and thus, feeding program into the > next just takes over the computation. PageRankVertexProgram doesn't use > HALTED_TRAVERSERS in its computation, so it just ignores it. However, > TraversalVertexProgram can later access PAGE_RANK and thus, use the results > of the previous OLAP computation. > Finally, you want "lambda vertex programs?" > {code} > g = graph.traversal().withComputer(graph -> > graph.compute(SparkGraphComputer.class).workers(10)); > myProgram = MyVertexProgram.build().create(); > g.V().program(myProgram).values('myProgramCounters') > {code} > which compiles to: > {code} > [TraversalVertexProgramStep([GraphStep]),LambdaVertexProgramStep(MyVertexProgram),TraversalVertexProgramStep([PropertiesStep]),ComputerResultStep] > {code} > Thus, just like {{filter()}},{{flatMap()}}, etc.... -- This message was sent by Atlassian JIRA (v6.3.4#6332)