Hello, I’m experimenting with moving between X query language and Y bytecode via Universal Bytecode.
The general (and very difficult) goal of TP4 is to be able to execute queries (from any known query language) against any database (regardless of underlying data model) using any processing engine. - e.g. Gremlin over MySQL (as Relational) using RxJava. - e.g. SQL over Cassandra (as WideColumn) using Flink. - e.g. SPARQL over MongoDB (as Document) using Akka. - e.g. Cypher over Neptune (as Graph) using Pipes. - e.g. ... —————— NOTES: 1. Realize that databases are both processors and structures. - MySQL has its own SQL engine. - Cassandra has its own CQL engine. - MongoDB has its own DocumentQuery engine. - … 2. What can be processed by the database’s engine should be evaluated by the database (typically). 3. What can not be processed by the database’s engine should be evaluated by the processor. 4. DATABASE_ENGINE->PROCESSOR->DATABASE_ENGINE->PROCESSOR->etc. - data may move from database to processor back to database back to processor, etc. to yield the final query result. The universal bytecode chunks in the examples to come assume the following n-tuple structure accessible via db(): [0][id:1, label:person, name:marko, outE:*1] [1][0:*2, 1:*3] [2][id:7, label:knows, outV:*0, inV:*4] [3][id:8, label:knows, outV:*0, inV:*5] [4][id:2, label:person, name:vadas] [5][id:4, label:person, name:josh] - All tuples have an id that is outside the id-space of the underlying data model. - Field values can have pointers to other tuples via *-prefix notation. Every compilation goes: Query language -> Universal Bytecode -> Data Model Bytecode -> Provider Bytecode —————— How do you compile Gremlin to universal bytecode for evaluation over MySQL? —— GREMLIN QUERY —— g.V().has(“name”,”marko”).out(“knows”).values(“name”) —— UNIVERSAL BYTECODE —— ==> (using tuple pointers) db().has(‘label’, within(‘person’,’project’)). has(‘name’,’marko’) values(‘outE’).has(‘label’,’knows’). values(‘inV’).values(‘name’) —— RELATIONAL BYTECODE —— ==> R(“people”,”projects").has(“name”,”marko”). join(R(“knows”)).by(“id”,eq(“outV”)). join(R(“people”)).by(“inV”,eq(“id”)). values(“name”) —— JDBC BYTECODE —— ==> union( sql(‘SELECT name FROM people as p1, knows, people as p2 WHERE p1.id=knows.outV AND knows.inV=p2.id’), sql(‘SELECT name FROM projects as p1, knows, projects as p2 WHERE p1.id=knows.outV AND knows.inV=p2.id’)) The assumed SQL tables are: CREATE TABLE people ( id int, label string, // person name string, PRIMARY KEY id ); CREATE TABLE knows ( id int, label string, // knows name string, outV int, inV int, PRIMARY KEY id, FOREIGN KEY (outV) REFERENCES people(id), FOREIGN KEY (inV) REFERENCES people(id) ); There needs to be two mapping specifications (Graph->Universal & Universal->Relational) - V() -> vertex tables are people+projects. - label() -> plural is table name - outE -> person.outE.knows is resolved via knows.outV (foreign key to person table by id) - inV -> knows.values(‘inV’) is resolved to person (foreign key to person table by id) Next, we are assuming that a property graph is encoded in MySQL as we have outE, inV, etc. column names. If we want to interpret any relational data as graph, then it is important to denote which tables are “join tables” and what the column-names are for joining. What about when a “join table” references more than 2 other rows? (i.e. n-ary relation — hypergraph) ? Is there a general solution to looking at any relational schema in terms of a binary property graph? —————— How do you compile SQL to universal bytecode for evaluation over Cassandra? —— SQL QUERY —— SELECT p2.name FROM people as p1, knows, people as p2 WHERE p1.name=marko AND p1.id=knows.outV AND knows.inV=p2.id —— UNIVERSAL BYTECODE —— ==> (using tuple pointers) db().has(‘label’,’person’).has(‘name’,’marko’). values(‘outE’).has(‘label’,’knows’). values(‘inV’).values(‘name’) —— WIDE-COLUMN BYTECODE —— ==> R(‘people’).has(‘name’,’marko’). values(‘outE’).has(‘label’,’knows’).values(‘inV’).as(‘$inV’) R(‘people’).has(‘id’,eq(path(‘$inV’))).values(‘name’) —— CASSANDRA BYTECODE —— ==> cql('SELECT outE:knows FROM people WHERE name=marko’). cql('SELECT name FROM people WHERE id=$inV’).by(‘inV’) There needs to be a mapping specification from SQL->Universal - knows.outV is foreign key a person row. - person.outE.knows is referenced by a knows.outV. The people-table is is defined as below where each edge is a column. CREATE TABLE people ( id int, name string, age int outE list<map<string,object>>, PRIMARY KEY (id,name) ); The last bytecode is chained CQL where each result from the first CQL instruction is fed into the second CQL instruction one at a time (i.e. cql() is a flatmap). While CQL does not support joins, Flink (the registered processor) does and thus, joins could be done by Flink. This email does not discuss processor strategies. —————— How do you compile SPARQL to universal bytecode for evaluation over MongoDB? —— SPARQL QUERY —— SELECT ?name WHERE { ?x ex:label person . ?x ex:name marko . ?x ex:outE ?y . ?y ex:label knows . ?y ex:inV ?z . ?z ex:name ?name } —— UNIVERSAL BYTECODE —— ==> (using tuple pointers) db().has(‘label’,’person’).has(‘name’,’marko’). values(‘outE’).has(‘label’,’knows’). values(‘inV’).values(‘name’) —— DOCUMENT BYTECODE —— ==> D().has(‘label’,’person’).has(‘name’,’marko’). values(‘outE’).has(‘label’,’knows’).values(‘outV’).as(‘a’) D().has(‘_id’,eq(path(‘a’))). values(’name’) —— MONGODB BYTECODE —— ==> mquery('{ label:’person’, name:’marko’, { outE: {label:’knows’} },{ outE.inV:1 }’). values(‘outE.inV’). mquery('{ _id:$inV },{ name:1 }’). values(‘name’) The MongoDB documents have an outE field which is a list of edges that are themselves objects with label and inV fields. —————— Just spittin’ ideas, Marko.