Dar. I knew there was a #3. You can never list things with bold itemizations 
and NOT have a #3.

3. Traversers as sinks to subsequent steps.

This falls from (2) and can be summed up as such:

g.V().has('name','UCSC').in('attended').
  pageRank().by(bothE('worksWith')).times(3).
  order().by('pageRank').limit(10).values('name')

Question: What are the vertices being ordered and limited whose names are 
returned? Algorithmically, before executing PageRankVertexProgramStep, there 
will be traversers at the UCSC attendees (HALTED_TRAVERSERS). Then PageRank 
executes. After that, all vertices "worksWith"-edges connected to the UCSC 
attendee sources will have PageRank values.  What is ultimately emitted -- we 
have two choices:

        1. The top 10 PageRank'd UCSC attendees within the "worksWith" subgraph?
                - revive the HALTED_TRAVERSERS from the previous 
TraversalVertexProgramStep execution.
        2. The top 10 PageRank'd vertices within the "worksWith" subgraph?
                - put traversers only at the vertices with energy from the 
PageRankVertexProgramStep.

I sort of like the #1 because you can also make #2 work by doing:

g.V().has('name','UCSC').in('attended').
  pageRank().by(bothE('worksWith')).times(3).
  V().order().by('pageRank').limit(10).values('name')   

Notice that for all the UCSC attendees leaving PageRankVertexProgramStep, we 
simply jump back to the global set of vertices and order everyone by their 
pageRank.

In essence, are the traversers that enter a XXXVertexProgramStep the traversers 
that exit a XXXVertexProgramStep. For TraversalVertexProgramStep its like that. 
For PageRankVertexProgram step, it can make sense and actually yield 
interesting ramifications such as:

g.V().has('name','UCSC').in('attended').
  pageRank().by(bothE('worksWith')).times(3).
  V().has('title','Ph.D.').order().by('pageRank').limit(10).values('name')      

I think a decision that is consistent is important so that people always know 
the traversers going in are X and the result is always like Y. If we get too 
crazy with each XXXVertexProgramStep doing different things with chained OLAP 
steps, it might lead to language confusions.

Decisions, decisions, decisions….

Marko.

http://markorodriguez.com

On Feb 13, 2016, at 9:25 AM, Marko Rodriguez <okramma...@gmail.com> wrote:

> Hi,
> 
> TinkerPop 3.2.0 boasts the ability to have multiple OLAP/OLTP jobs within a 
> single Traversal instance. The following tickets have been closed and merged 
> to master/ (TinkerPop 3.2.0-SNAPSHOT).
> 
>       https://issues.apache.org/jira/browse/TINKERPOP-1140
>       https://issues.apache.org/jira/browse/TINKERPOP-971
>       https://issues.apache.org/jira/browse/TINKERPOP-962
> 
> Instead of going through the gnarly details, I will explain with a simple 
> example:
> 
> gremlin> g = TinkerFactory.createModern().traversal().withComputer()
> ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], tinkergraphcomputer]
> gremlin> g.V().pageRank().order().by(PAGE_RANK).valueMap()
> ==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], 
> name:[peter], age:[35]]
> ==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], 
> name:[marko], age:[29]]
> ==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], 
> name:[josh], age:[32]]
> ==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], 
> name:[vadas], age:[27]]
> ==>[gremlin.pageRankVertexProgram.pageRank:[0.23181250000000003], 
> name:[ripple], lang:[java]]
> ==>[gremlin.pageRankVertexProgram.pageRank:[0.4018125], name:[lop], 
> lang:[java]]
> gremlin>
> gremlin> 
> g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap().toString()
> ==>[GraphStep([],vertex), PageRankVertexProgramStep([VertexStep(OUT,edge)]), 
> OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))]), 
> PropertyMapStep(value)]
> gremlin>
> gremlin> 
> g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap().iterate().toString()
> ==>[PageRankVertexProgramStep([VertexStep(OUT,edge)]), 
> TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]),
>  ComputerResultStep, PropertyMapStep(value)]
> 
> As you can see from the compilation we have 2 OLAP jobs and one OLTP job in a 
> single Traversal! TraversalVertexProgramStep was always what an OLAP Gremlin 
> traversal was, but now its a step in and of itself just like any other step 
> in on the Gremlin machine.
> 
> OLAP [PageRankVertexProgramStep([VertexStep(OUT,edge)]), 
> OLAP  
> TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]),
>  
> OLTP  ComputerResultStep, PropertyMapStep(value)]
> 
> There is still more work to be done/tweaked in the area for 3.2.0. This is 
> what this email [DISCUSS] is for. There are some decisions we can make and I 
> would like people's thoughts on the matter before we make them:
> 
> ------------------------------------------------------------------------
> 
> 1. Parameterizing OLAP steps.
> 
> What about the following parameterization below:
> 
> g.V().pageRank(0.85).times(20).by(outE('knows')).by('page.rank')
> 
> This says I want a PageRankVertexProgram executed with an alpha parameter at 
> 0.85, to iterate 20 times, using "knows" edges for the energy diffusion, and 
> the property to save the result to on the vertex being "page.rank." As you 
> may know "times" and "by" are step-modulators. However, when by(string) is 
> used, it currently compiles to by(values(string).limit(1)). If we go down 
> this road of adding more VertexPrograms, more by-modulations, I think we need 
> to make a new interface called ByModulating that allows the step to decide 
> (not the Traversal) what by(traversal), by(string), by(number), by(function), 
> etc. mean to it. Likewise, we would need TimesModulating where RepeatStep and 
> PageRankVertexProgramStep will do two different things with that information. 
> Thoughts?
> 
> 2. Traversers as source points to OLAP steps.
> 
> Imagine the following traversal:
> 
> g.V().has('name','UCSC').in('attended').pageRank().by(bothE('worksWith'))
> 
> What this means to me is the initial pageRank energy will start at all people 
> who attended the University of California at Santa Cruz and then diffuse by 
> worksWith-edges. You may say -- "but Marko, regardless of the initial 
> distribution, PageRank always converges to a stable state distribution." To 
> which I say, "now include times(3)"-- that is, only iterate the energy 
> 3-steps. Now you have biased-PageRank also (kinda sorta) like 
> PageRank-priors. This is great for recommendation engines as you aren't 
> identifying a global energy distribution, but a local one (rooted at the 
> energy source). Do people like the idea of PageRank being biased by the 
> initial traverser set? Moreover, imagine someone attended UCSC twice (lets 
> say for this example), double the energy for them?! There are other OLAP 
> algorithms that can leverage this -- think PeerPressureVertexProgram (more 
> VOTE_WEIGHT by the sources).
> 
> What are people's thoughts on the matter and what ideas do you have to make 
> Traversal-OLAP all the better in TinkerPop 3.2.0.
> 
> Thanks everyone,
> Marko.
> 
> http://markorodriguez.com
> 

Reply via email to