GitHub user okram opened a pull request: https://github.com/apache/incubator-tinkerpop/pull/227
TINKERPOP-1140: TraversalVertexProgramStep in support of OLAP/OLTP conversions. https://issues.apache.org/jira/browse/TINKERPOP-1140 `Traversals` can now be composed of multiple OLAP jobs. This ticket in particular introduced `TraversalVertexProgramStep` which is responsible for executing a OLAP job and then sending its results to the next OLAP step (implementing a `VertexComputing`) or, it just drains the results serially to the next OLTP step. To demonstrate this working, I added `PageRankVertexProgramStep` (experimental for now -- a future commit will make all of this more explicit and include `PeerPressureVertexProgramStep`. This ticket is particular to `TraversalVertexProgramStep`). Here is the glory. ``` gremlin> g = TinkerFactory.createModern().traversal().withComputer() ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], tinkergraphcomputer] gremlin> g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap() ==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[peter], gremlin.pageRankVertexProgram.edgeCount:[1.0], age:[35]] ==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[marko], gremlin.pageRankVertexProgram.edgeCount:[3.0], age:[29]] ==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], name:[josh], gremlin.pageRankVertexProgram.edgeCount:[2.0], age:[32]] ==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], name:[vadas], gremlin.pageRankVertexProgram.edgeCount:[0.0], age:[27]] ==>[gremlin.pageRankVertexProgram.pageRank:[0.23181250000000003], name:[ripple], gremlin.pageRankVertexProgram.edgeCount:[0.0], lang:[java]] ==>[gremlin.pageRankVertexProgram.pageRank:[0.4018125], name:[lop], gremlin.pageRankVertexProgram.edgeCount:[0.0], lang:[java]] gremlin> g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap().explain() ==>Traversal Explanation ======================================================================================================================================================================================================================================== Original Traversal [GraphStep([],vertex), PageRankVertexProgramStep([VertexStep(OUT,edge)]), OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))]), PropertyMapStep(value)] ConnectiveStrategy [D] [GraphStep([],vertex), PageRankVertexProgramStep([VertexStep(OUT,edge)]), OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))]), PropertyMapStep(value)] VertexProgramStrategy [D] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] IdentityRemovalStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] FilterRankingStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] IncidentToAdjacentStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] AdjacentToIncidentStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] MatchPredicateStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] RangeByIsCountStrategy [O] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] TinkerGraphStepStrategy [P] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] ProfileStrategy [F] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] StandardVerificationStrategy [V] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] ComputerVerificationStrategy [V] [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] Final Traversal [PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] gremlin> gremlin> g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap().toString() ==>[GraphStep([],vertex), PageRankVertexProgramStep([VertexStep(OUT,edge)]), OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))]), PropertyMapStep(value)] gremlin> g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap().iterate().toString() ==>[PageRankVertexProgramStep([VertexStep(OUT,edge)]), TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]), ComputerResultStep, PropertyMapStep(value)] ``` CHANGELOG: ``` * Fixed a bug in both `SparkGraphComputer` and `GiraphGraphComputer` regarding source data access in job chains. * Significantly extended job chaining test coverage for `GraphComputer` providers. * Added `TraversalHelper.onGraphComputer(traversal)`. * `MapReduce.map()` no longer has a default implementation. This method must be implemented. * `TraversalVertexProgram` can work without a `GraphStep` start. * Fixed a severe bug in `TraverserMapReduce.combine()`. * Added `PageRankVertexProgramStep` and `GraphTraversal.pageRank()`. * Simplified the comparator model in `OrderGlobalStep` and `OrderLocalStep`. ``` The only note for the upgrade docs will be that `MapReduce.map()` no longer have a default implementation. However, if you didn't implement it, your `MapReduce` job would instantly fail so I don't think anyone has code like that. Oh, and `TraversalHelper.onGraphComputer()` will replace the "stub method" I had in the previous PR with `TraversalStrategies.onGraphComputer()`. Thus, minor nothings to upgrade docs. *** NOTE: I will add `pageRank()` to the reference doc in a future 3.2.0 commit when I do another ticket that is more specific to `PageRankVertexProgramStep` and `PeerPressureVertexProgram`. VOTE +1. `mvn clean install` and integration tests -- (though Giraph is currently still running -- but I have isolated and verified the two tests I added to the `ProcessComputerSuite`). You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1140 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-tinkerpop/pull/227.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #227 ---- commit 64cd1b18df1170e96d61c0e68cc6495d32c7fa3e Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-09T22:29:31Z TraversalVertexProgramStep created. It works. Sorta --- need to get ComputerVerificationStrategy more aware. 95 percent of tests pass -- the ones that fail have to do with deep nested traverasls and ComputerVerificationStrategy not catching them. Anywho, once this is solid, the door is open for PageRankVertexProgramStep. Going to get crazy here soon. Also, ComputerResultStep now simply traversifies ComputerResults. Really elegant. commit f2dbb0f2094bb7044e3c1f8317fc2153c9010b2f Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-10T18:11:35Z Have TraversalVertexProgramStep working fully. I had to Ignore tests in ExplainTest and ProfileTest as OLTP and OLAP now no longer compile to the same representation. This is both good and bad. Its good because that means that a Traversal is not globally OLAP or globally OLTP. It can have OLAP and OLTP sections. Its bad in that features like Profile and Explain need to be smart about mixed-engine traversals. I am still far from finished ... just want to push to save work. commit e5c3d784e2db06febd4da1d079712336818fd0ca Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-10T18:59:04Z fixed a bug around Serialization of GraphComputer. Integration tests now pass. Cool. commit 6cbd8a5da0b8dad4f675c586a809e661a902076b Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-10T19:36:20Z Merge branch 'TINKERPOP-971' into TINKERPOP-1140 commit 080863a5837d94417b9f9fc17d0476defab56e63 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-10T22:23:02Z okay. TraversalVertexProgramStep is just like any other step. Nothing 'special'. Its a TraversalParent with one global child -- the traversal to be executed on the GraphComputer. I got rid of TraversalSideEffects.onGraphComputer() as again, a traversal can now have OLTP and OLAP components. In its place, we have TraversalHelper.onGraphComputer(traversal) which will tell you if a parent of the traversal is a TraversalVertexProgramStep. this model is really really nice as it all falls within the natural construts of our Traversal API -- no new interfaces needed, no weird two sets of strategies, no weird two sets of sideEffects... Its really clean. commit 3afc1894a4805e82ad076535697add3baac91c6c Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-11T01:41:07Z Stubbed PageRankVertexProgramStep. Cleaned up and optimized a few tid bits here and there. Its going to get nasty tomorrow. commit a4cf1514d95dfd873675f08e6fd5d08aea810992 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-11T19:39:30Z Okay. Here is is. g.V().pageRank().order.by(pageRank).name.limit(10). We know support chained OLAP jobs within a single Traversal. This is an amazing piece of code. There is still lots more to do, but the basic framework is working and is sound. Note that we were able to loosen up lots of ComputerVerificaitonStrategy restrictions. Moreover, over the next couple of pushes, we will loosen up even more. The OrderStep comparator stuff was a rats nest that I have since cleaned up ... sometimes it wants a traverser, sometimes an object...random. I can't believe we haven't seen more bugs cause of what we had. commit e2e38e691ee20ba697c2e737e678376a10f87da2 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-11T19:46:28Z Merge branch 'master' into TINKERPOP-1140 commit 17bfdb82ce89e24295940ad1926ae95e17eea821 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-11T21:32:30Z Was really dumb about SparkMessenger and GiraphMessenger. No need to insert a start step. Weird. ComputerVerificationStrategy can now recognize when edge properties are being manipulated in MapReduce. Added TraveralHelper.getLastElementClass(traversal). Neato. Fixed up FileSystem access in SparkGraphComputer. I think we have a bug in 3.1.2 that makes chaining files bad. Weird we don't have a test for that. However, most people don't chain vertex programs so I suspect no one will notice. 3.2.0 have it down pat as you need solid chaining the new multi-OLAP compiler to work. commit 66d86cdfbb00bf1eb74f067815181f90b40d6feb Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-11T21:53:00Z there is a bug in GiraphGraphComputer when chaining jobs together. damn. I have things a bit cleaner. committing what I have for now. commit cd1a37946848d28ad33d464fedac1655cadd5320 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-12T18:57:05Z Found a couple of bugs around job chaining in all GraphComputer implementations. Wrote a solid GraphComputerTest to ensure job chaining behaves as expected. Still a few more things to handle -- the Traversal.pageRank() tests in Giraph don't pass and I'm not sure why. commit 0a2243037792854c6ef9f834f41785e7a728ee7e Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-12T21:02:46Z Found a bug in TraverserMapReduce where the combiner should not just call reduce(). Only Giraph was able to expose it because it actually calls combine(). The test dataset is so small that Spark doesn't even kick it off. TinkerGraph, because its in memory, doesn't use combine. Wow. That was a two day bug. However, I have written so many more test cases. Things are really pretty. And that ends my week with g.V.pageRank().order().by().limit(10) working. Phew. commit 491bef66112d75d42cde87ae000e63587e9b3bfe Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-12T22:15:18Z added another PageRankTest test case and made MapReduce.map() a non-default implementation so users know they have to implement that method. commit d627f9c06cd24b9c12d0938e3ebff20a39ae33b9 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-02-12T22:31:12Z minor nothing. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---