[DISCUSS] How do you want your OLAP?

Marko Rodriguez Sat, 13 Feb 2016 08:26:13 -0800

Hi,

TinkerPop 3.2.0 boasts the ability to have multiple OLAP/OLTP jobs within a 
single Traversal instance. The following tickets have been closed and merged to 
master/ (TinkerPop 3.2.0-SNAPSHOT).


        https://issues.apache.org/jira/browse/TINKERPOP-1140
        https://issues.apache.org/jira/browse/TINKERPOP-971
        https://issues.apache.org/jira/browse/TINKERPOP-962

Instead of going through the gnarly details, I will explain with a simple 
example:

gremlin> g = TinkerFactory.createModern().traversal().withComputer()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], tinkergraphcomputer]
gremlin> g.V().pageRank().order().by(PAGE_RANK).valueMap()
==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[peter], 
age:[35]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.15000000000000002], name:[marko], 
age:[29]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], name:[josh], 
age:[32]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.19250000000000003], name:[vadas], 
age:[27]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.23181250000000003], 
name:[ripple], lang:[java]]
==>[gremlin.pageRankVertexProgram.pageRank:[0.4018125], name:[lop], lang:[java]]
gremlin>
gremlin> 
g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap().toString()
==>[GraphStep([],vertex), PageRankVertexProgramStep([VertexStep(OUT,edge)]), 
OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))]), 
PropertyMapStep(value)]
gremlin>
gremlin> 
g.V().pageRank().order().by(PageRankVertexProgram.PAGE_RANK).valueMap().iterate().toString()
==>[PageRankVertexProgramStep([VertexStep(OUT,edge)]), 
TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]),
 ComputerResultStep, PropertyMapStep(value)]

As you can see from the compilation we have 2 OLAP jobs and one OLTP job in a 
single Traversal! TraversalVertexProgramStep was always what an OLAP Gremlin 
traversal was, but now its a step in and of itself just like any other step in 
on the Gremlin machine.

OLAP [PageRankVertexProgramStep([VertexStep(OUT,edge)]), 
OLAP  
TraversalVertexProgramStep([OrderGlobalStep([incr(value(gremlin.pageRankVertexProgram.pageRank))])]),
 
OLTP  ComputerResultStep, PropertyMapStep(value)]

There is still more work to be done/tweaked in the area for 3.2.0. This is what 
this email [DISCUSS] is for. There are some decisions we can make and I would 
like people's thoughts on the matter before we make them:

------------------------------------------------------------------------

1. Parameterizing OLAP steps.

What about the following parameterization below:

g.V().pageRank(0.85).times(20).by(outE('knows')).by('page.rank')

This says I want a PageRankVertexProgram executed with an alpha parameter at 
0.85, to iterate 20 times, using "knows" edges for the energy diffusion, and 
the property to save the result to on the vertex being "page.rank." As you may 
know "times" and "by" are step-modulators. However, when by(string) is used, it 
currently compiles to by(values(string).limit(1)). If we go down this road of 
adding more VertexPrograms, more by-modulations, I think we need to make a new 
interface called ByModulating that allows the step to decide (not the 
Traversal) what by(traversal), by(string), by(number), by(function), etc. mean 
to it. Likewise, we would need TimesModulating where RepeatStep and 
PageRankVertexProgramStep will do two different things with that information. 
Thoughts?

2. Traversers as source points to OLAP steps.

Imagine the following traversal:

g.V().has('name','UCSC').in('attended').pageRank().by(bothE('worksWith'))

What this means to me is the initial pageRank energy will start at all people 
who attended the University of California at Santa Cruz and then diffuse by 
worksWith-edges. You may say -- "but Marko, regardless of the initial 
distribution, PageRank always converges to a stable state distribution." To 
which I say, "now include times(3)"-- that is, only iterate the energy 3-steps. 
Now you have biased-PageRank also (kinda sorta) like PageRank-priors. This is 
great for recommendation engines as you aren't identifying a global energy 
distribution, but a local one (rooted at the energy source). Do people like the 
idea of PageRank being biased by the initial traverser set? Moreover, imagine 
someone attended UCSC twice (lets say for this example), double the energy for 
them?! There are other OLAP algorithms that can leverage this -- think 
PeerPressureVertexProgram (more VOTE_WEIGHT by the sources).

What are people's thoughts on the matter and what ideas do you have to make 
Traversal-OLAP all the better in TinkerPop 3.2.0.

Thanks everyone,
Marko.

http://markorodriguez.com

[DISCUSS] How do you want your OLAP?

Reply via email to