3.1.0 sack/bulking update applicability to sql-gremlin

Ted Wilmes Tue, 03 Nov 2015 10:10:48 -0800

Hello,
I've been working on a sql-to-gremlin compiler and have a question about
the latest 3.1.0-SNAPSHOT updates to sack, specifically the merge bits.  As
I was writing a bunch of essentially recursive tree walking code to
translate a Calcite logical plan into Gremlin, I thought to myself, maybe
this would be cleaner if I loaded the logical plan into a TinkerGraph and
then wrote some Gremlin to traverse the plan and in turn, produce the
compiled Gremlin (consequently, this got my thinking down how cool would it
be if I could directly interact with the current object graph without
having to perform the intermediate step of loading it into a separate graph
implementation, like a TP3 enabled JVM interface).  In addition to
readability, this has the added bonus of being kind of meta, Gremlin begets
Gremlin.  I started down this path, and I think it would have been nasty
without the 3.1 sack updates since there wasn't a clean way to define what
would occur on a merge.  Here's a basic example of what I'm thinking, maybe
I'm off base though.


Given the following sql query against the Northwinds schema (imagine a the
graph version vs the relational version sitting behind it eg. sql2gremlin):

select * from customer c
    inner join country co on c.country_id = co.country_id
        where c.name = 'United States'

Calcite produces the following logical plan which is basically the query
parsed + transformations applied by Calcite's optimizer.  I don't have many
rules turned on, so you'll see here that the filters aren't pushed down
below the join, but for example purposes, this should do it.

EnumerableProject(CUSTOMER_ID=[$0], ORDER_ID=[$1], COUNTRY_ID=[$2],
REGION_ID=[$3], NAME=[$4], COUNTRY_ID0=[$5], NAME0=[$6])
    EnumerableFilter(condition=[=(CAST($4):VARCHAR(3) CHARACTER SET
"ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary", 'United States')])
        EnumerableFilter(condition=[=($2, $2)])
            EnumerableJoin(condition=[true], joinType=[inner])
                GremlinToEnumerableConverter
                    GremlinTableScan(table=[[gremlin, CUSTOMER]])
                GremlinToEnumerableConverter
                    GremlinTableScan(table=[[gremlin, COUNTRY]])

That has a lot of cruft not pertinent to the core of my question, but the
salient bit is that it is internally represented as an object graph of
"relation nodes".  I've loaded these relation nodes (rel nodes) into a
TinkerGraph and my basic idea is to start down at what is called the
GremlinTableScan (you could think of this as the most basic retrieval of
all vertices filtered by a given label), and then work my way upwards using
the sack to hold the traversals as I build them.  For example, you start
with the equivalent of retrieval all vertices with label 'x', but then you
hit a filter, so add a 'has'.  When a "join" node is hit, the sacks would
be merged by taking the incoming sack traversals and generating a match
statement.  An initial cut of this may nest matches within matches, but I
think it could fairly easily be made to just create one uber-match
statement.  Here is some example code to load a similar test graph into a
TinkerGraph.

graph = TinkerGraph.open()
scan1 = graph.addVertex(label, "relNode", "relType","scan")
scan2 = graph.addVertex(label, "relNode", "relType","scan")
filter1 = graph.addVertex(label, "relNode", "relType", "filter")
filter2 = graph.addVertex(label, "relNode", "relType", "filter")
join = graph.addVertex(label, "relNode", "relType", "join")
project = graph.addVertex(label, "relNode", "relType", "project")

project.addEdge("hasInput", join)
join.addEdge("hasInput", filter1)
join.addEdge("hasInput", filter2)
filter1.addEdge("hasInput", scan1)
filter2.addEdge("hasInput", scan2)

My current issue is that I can't figure out how to get the sack merge on
join to work.  I admittedly could also be way off base with this approach
due to some fundamental misunderstanding, but the sack descriptions
regarding splitting and merging of energies made a lot of sense to me and
applicable to what I'm trying to do.

Here's an example of what I've tried.  In this, case, I'm simulating
building the traversals up by seeding the sack with an array, and then
simply adding the relTyp (scan, filter, project) as I walk backwards up
from the scans.  In  this case, there aren't any splits, but there should
be the one merge at the "join" relNode.  I've defined my merge operator as
adding the two incoming lists to a single list thereby producing a list of
lists.  My TP3 skills are in their infancy but here it goes:

gremlin> g.withBulk(false).withSack{[]}{it.clone()}{a, b -> l = []; l << a;
l << b;
l}.V().has('relType','scan').until(has('type','join')).repeat(sack{s, v ->
s << v.value('relType')}.in('hasInput'))emit().barrier().sack()
==>[scan]
==>[[scan, filter], [scan, filter]]
==>[[scan, filter, join], [scan, filter, join]]
==>[scan]

I expected to just end up with a single result of [[scan, filter, join],
[scan, filter, join]] but I think I am misunderstanding how the individual
traversers are working and in turn merging.  Is there a way I could tweak
this query (or full out rewrite it) so that I end up with a single sack
result at the end containing that list of lists (or in the real
application, the two separate traversals that I'd then add to a match step)

Thanks,
Ted

3.1.0 sack/bulking update applicability to sql-gremlin

Reply via email to