A collection of examples that map a query language query to provider bytecode.

Marko Rodriguez Mon, 06 May 2019 08:59:17 -0700

Hello,

I’m experimenting with moving between X query language and Y bytecode via 
Universal Bytecode.


The general (and very difficult) goal of TP4 is to be able to execute queries 
(from any known query language) against any database (regardless of underlying 
data model) using any processing engine.
        - e.g. Gremlin over MySQL (as Relational) using RxJava.
        - e.g. SQL over Cassandra (as WideColumn) using Flink.
        - e.g. SPARQL over MongoDB (as Document) using Akka.
        - e.g. Cypher over Neptune (as Graph) using Pipes.
        - e.g. ...

——————

NOTES:
        1. Realize that databases are both processors and structures.
                - MySQL has its own SQL engine.
                - Cassandra has its own CQL engine.
                - MongoDB has its own DocumentQuery engine.
                - …
        2. What can be processed by the database’s engine should be evaluated 
by the database (typically).
        3. What can not be processed by the database’s engine should be 
evaluated by the processor.
        4. DATABASE_ENGINE->PROCESSOR->DATABASE_ENGINE->PROCESSOR->etc.
                - data may move from database to processor back to database 
back to processor, etc. to yield the final query result.

The universal bytecode chunks in the examples to come assume the following 
n-tuple structure accessible via db():

[0][id:1, label:person, name:marko, outE:*1]
[1][0:*2, 1:*3]
[2][id:7, label:knows, outV:*0, inV:*4]
[3][id:8, label:knows, outV:*0, inV:*5]
[4][id:2, label:person, name:vadas]
[5][id:4, label:person, name:josh]

- All tuples have an id that is outside the id-space of the underlying data 
model.
- Field values can have pointers to other tuples via *-prefix notation.

Every compilation goes:

        Query language -> Universal Bytecode -> Data Model Bytecode -> Provider 
Bytecode

——————

How do you compile Gremlin to universal bytecode for evaluation over MySQL?

—— GREMLIN QUERY ——

g.V().has(“name”,”marko”).out(“knows”).values(“name”)

—— UNIVERSAL BYTECODE ——

   ==> (using tuple pointers)

db().has(‘label’, within(‘person’,’project’)).
     has(‘name’,’marko’)
  values(‘outE’).has(‘label’,’knows’).
  values(‘inV’).values(‘name’)

—— RELATIONAL BYTECODE ——

   ==>

R(“people”,”projects").has(“name”,”marko”).
 join(R(“knows”)).by(“id”,eq(“outV”)).
 join(R(“people”)).by(“inV”,eq(“id”)).
  values(“name”)

—— JDBC BYTECODE ——

   ==>

union(
  sql(‘SELECT name FROM people as p1, knows, people as p2 WHERE 
p1.id=knows.outV AND knows.inV=p2.id’),
  sql(‘SELECT name FROM projects as p1, knows, projects as p2 WHERE 
p1.id=knows.outV AND knows.inV=p2.id’))
        
The assumed SQL tables are:

CREATE TABLE people (
    id int,
    label string, // person
    name string,
    PRIMARY KEY id
);

CREATE TABLE knows (
    id int,
    label string, // knows
    name string,
    outV int,
    inV int,
    PRIMARY KEY id,
    FOREIGN KEY (outV) REFERENCES people(id),
    FOREIGN KEY (inV) REFERENCES people(id)
);

There needs to be two mapping specifications (Graph->Universal & 
Universal->Relational)
        - V() -> vertex tables are people+projects.
        - label() -> plural is table name
        - outE -> person.outE.knows is resolved via knows.outV (foreign key to 
person table by id)
        - inV -> knows.values(‘inV’) is resolved to person (foreign key to 
person table by id)

Next, we are assuming that a property graph is encoded in MySQL as we have 
outE, inV, etc. column names. If we want to interpret any relational data as 
graph, then it is important to denote which tables are “join tables” and what 
the column-names are for joining. What about when a “join table” references 
more than 2 other rows? (i.e. n-ary relation — hypergraph) ? Is there a general 
solution to looking at any relational schema in terms of a binary property 
graph?

——————

How do you compile SQL to universal bytecode for evaluation over Cassandra?
        
—— SQL QUERY ——

SELECT p2.name FROM people as p1, knows, people as p2 
  WHERE p1.name=marko AND p1.id=knows.outV AND knows.inV=p2.id

—— UNIVERSAL BYTECODE ——

   ==> (using tuple pointers)

db().has(‘label’,’person’).has(‘name’,’marko’).
  values(‘outE’).has(‘label’,’knows’).
  values(‘inV’).values(‘name’)

—— WIDE-COLUMN BYTECODE ——

   ==>

R(‘people’).has(‘name’,’marko’).
  values(‘outE’).has(‘label’,’knows’).values(‘inV’).as(‘$inV’)
R(‘people’).has(‘id’,eq(path(‘$inV’))).values(‘name’)

—— CASSANDRA BYTECODE ——

   ==> 

cql('SELECT outE:knows FROM people WHERE name=marko’).
cql('SELECT name FROM people WHERE id=$inV’).by(‘inV’)

There needs to be a mapping specification from SQL->Universal
        - knows.outV is foreign key a person row.
        - person.outE.knows is referenced by a knows.outV.

The people-table is is defined as below where each edge is a column.

CREATE TABLE people (
    id int,
    name string,
    age int
    outE list<map<string,object>>,
    PRIMARY KEY (id,name)
);

The last bytecode is chained CQL where each result from the first CQL 
instruction is fed into the second CQL instruction one at a time (i.e. cql() is 
a flatmap). While CQL does not support joins, Flink (the registered processor) 
does and thus, joins could be done by Flink. This email does not discuss 
processor strategies.

——————

How do you compile SPARQL to universal bytecode for evaluation over MongoDB?

—— SPARQL QUERY ——

SELECT ?name WHERE {
  ?x ex:label person .
  ?x ex:name marko .
  ?x ex:outE ?y .
  ?y ex:label knows .
  ?y ex:inV ?z .
  ?z ex:name ?name
}

—— UNIVERSAL BYTECODE ——

   ==> (using tuple pointers)

db().has(‘label’,’person’).has(‘name’,’marko’).
  values(‘outE’).has(‘label’,’knows’).
  values(‘inV’).values(‘name’)

—— DOCUMENT BYTECODE ——

   ==>

D().has(‘label’,’person’).has(‘name’,’marko’).
  values(‘outE’).has(‘label’,’knows’).values(‘outV’).as(‘a’)
D().has(‘_id’,eq(path(‘a’))).
  values(’name’)    

—— MONGODB BYTECODE ——

   ==> 

mquery('{ label:’person’, name:’marko’, { outE: {label:’knows’} },{ outE.inV:1 
}’).
  values(‘outE.inV’).
mquery('{ _id:$inV },{ name:1 }’).
  values(‘name’)

The MongoDB documents have an outE field which is a list of edges that are 
themselves objects with label and inV fields.

——————

Just spittin’ ideas,
Marko.

A collection of examples that map a query language query to provider bytecode.

Reply via email to