The TP4 Universal Model ==> A Multi-Model ADT

Marko Rodriguez Thu, 09 May 2019 12:18:10 -0700

Hi,

*** I’ve started a GoogleDoc called “A Multi-Model Data Type Specification.” ***


An abstract data type is a data structure + operations to manipulate it. 
Classic examples include:
        1. stacks — arrays with push() and pop() operations.
        2. lists — arrays with add(), remove(), get(), etc. operations.
        3. graphs — networks with out(), in(), etc. operations.
        4. …

Databases can be defined by their ADT. Database ADTs typically involve a data 
structure+indices and a set of data manipulation operations. 
        1. key/value — pairs+key-index with get(), remove(), put(), etc. 
operations.
        2. relational — relations+indices with select(), project(), join(), 
etc. operations.
        3. RDF — statements+spog-indices with subject(), predicate(), object(), 
match(), etc. operations.
        4. graph — vertices+edges+indices with has(), values(), out(), in(), 
etc. operations.
        5. …

In the spec thus far, I argue that the database industry has become overly 
fixated on classifying databases into discrete categories each with their own 
unique terminology (vertices/edge, tables/rows, documents, statements) and 
overlapping operations. I believe the primary reason for this is that databases 
are monolithic systems composed of a query language, a processing engine, and a 
data storage system. When all these pieces are assembled by the database 
engineering team, a “data perspective” (ADT) is set in stone.

I believe this has unnecessarily created database technology silos.

———

What we are trying to do at Apache TinkerPop is create a multi-model ADT that 
spans the various database categories by using a generic lexicon and set of 
operations capable of performing all database operations. People may argue: 

        “Why not just use the relational ADT and table/row lexicon as it can 
emulate every other know ADT relatively naturally?”

I believe we are basically doing that with n-tuples (i.e. "schemaless rows"). 
However, what makes our approach unique is that our ADT doesn’t assume that it 
will solely be used by a monolithic database system. Instead, our ADT is 
designed on the assumption that the storage system, the processing engine, and 
the query language are independent components that are ultimately integrated 
into a “synthetic database” (a database that is custom assembled to meet the 
data modeling and performance requirements of an end user’s applications). 
Synthetic databases are possible with our multi-model ADT.

A multi-model ADT compliant property graph query language assumes a basic 
property graph ADT embedding.

{graph}
  V()
{vertex}
  id()
  label()
  outE()
  inE()
{edge}
  id()
  label()
  outV()
  inV()

The query language says that it understands map-tuples ({}) of type graph, 
vertex, and edge. Moreover, along with core bytecode (has,values,…), it states 
that these tuples should be able to be processed using by the provided property 
graph-specific instructions. Thus,

g.V(1).out(‘knows’).values(‘name’)
  === Gremlin compiles to Basic Property Graph Bytecode ==>
V().filter(id().is(1)).outE().filter(label().is(‘knows’)).inV().values(‘name’)

Without strategies, the above bytecode would execute as a series of inefficient 
linear “scan and filter” operations — the basic functional requirements of a 
“property graph." However, the data storage system says that it has various 
indices (and accessing instructions) for these types of tuples.

{graph}
  V()
  V(object)
{vertex}
  id()
  label()
  outE()
  inE()
  out(string..)
  in(string...)
{edge}
  id()
  label()
  outV()
  inV()

Thus, its property graph ADT extends the basic property graph ADT used by the 
query language. This enables TP4 strategies to rewrite the submitted bytecode 
to use the data storage system’s supported instructions (i.e. optimizations).

V().filter(id().is(1)).outE().filter(label().is(‘knows’)).inV().values(‘name’)
  === Property Graph Bytecode compiles to Data Storage System Optimized 
Bytecode ==>
V(1).out(‘knows').values(‘name’)

This bytecode is then passed to the processing engine which seamlessly operates 
on the data storage system’s tuples as defined by the instructions.

Question: What is the out(‘knows’) instruction? Simple, its a 
FlatMapFunction<Vertex,Edge> that calls the following method on the TP4 Vertex 
interface.

<Iterator<Edge>> Vertex.out(String… labels)

The data storage system says that its vertex tuple objects supports the 
out(string…) instruction and thus, the data storage system is organizing 
incident edges by label in its respective substrate (i.e. disk or memory). 
Great!

IMPORTANT: Notice that our multi-model ADT’s operations are bytecode 
instructions. There is no longer a concept of “pointers.” A pointer is simply a 
map instruction! This means that our multi-model specification is not just a 
data structure specification, but also a bytecode specification. I believe we 
will ultimately have one spec driving TP4 VM development!

————————

!@#$@#^$%@&@%^@#$#
Now lets get crazy………..
@#$%@#$%@#$%@#$%^

Assume the data storage system actually produces a tuple with the following 
information.

{ #pg.type:vertex, #rdbms.type:row, #pg.label:person, #id:1, name:marko, age:29 
}

Ah ha! So this tuple is both a vertex and a row! What does that mean? It means 
that we can ask the data storage system how it encodes the RDBMS ADT. Suppose 
it tell us:

{database}
  table(string)
{table}
  database()
  rows()
  select(string,predicate)
  project(string...)
  join(table, predicate…)
{row}
  asTable()
  table()

This specification says that it produces database, table, and row tuples that 
support the respective bytecode instructions.

We can now interpret the tuple as a Row and can operate on it as such.

{row}.asTable().join(database().table(‘companies’)).by(‘name’,’employee_name’).values(‘industry’)

In essence, if a “multi-model language” existed, the following 
‘graph’/‘rdbms’-hybrid query would have been:

g.V(1).out(‘knows’).
  sql(‘SELECT industry 
         FROM people, companies 
         WHERE people.id=$id AND people.name=employee_name’).by(id())

What is this frankenstein data structure? Its a RDBMS with foreign key 
relations that can be traversed! Or its a graph database that supports dynamic 
edge creation (i.e. vertex joins). Or, its just the multi-model ADT being 
sweeeeeeet. 

A few points to realize (*** Kuppitz will like this ***):

1. When we have a {table} tuple, we simply have a proxy/reference to (e.g.) the 
respective MySQL table. 
2. The processing engine isn’t pulling in the table’s data.
3. Because the data storage system supports join(), we let it do the work 
instead of having the processing engine do it.

In conclusion.

Just like any other database ADT, Our multi-model ADT can be used to define 
data structures and operations. However, our multi-model ADT can be used to 
simulate other database ADTs using their native lexicon and operations. 
Moreover, tuples in a our multi-model ADT can be used across multiple ADT 
embeddings! Finally, query language, processing engine, and data storage system 
stakeholders can all be different entities whose technologies interact with 
each other via multi-model ADT tuples and bytecode, where the TP4 VM can 
speedily glue it all together at query time using each component’s published 
ADT embeddings.

Thoughts?,
Marko.

http://rredux.com <http://rredux.com/>

The TP4 Universal Model ==> A Multi-Model ADT

Reply via email to