Hi,
*** I’ve started a GoogleDoc called “A Multi-Model Data Type Specification.” ***
An abstract data type is a data structure + operations to manipulate it.
Classic examples include:
1. stacks — arrays with push() and pop() operations.
2. lists — arrays with add(), remove(), get(), etc. operations.
3. graphs — networks with out(), in(), etc. operations.
4. …
Databases can be defined by their ADT. Database ADTs typically involve a data
structure+indices and a set of data manipulation operations.
1. key/value — pairs+key-index with get(), remove(), put(), etc.
operations.
2. relational — relations+indices with select(), project(), join(),
etc. operations.
3. RDF — statements+spog-indices with subject(), predicate(), object(),
match(), etc. operations.
4. graph — vertices+edges+indices with has(), values(), out(), in(),
etc. operations.
5. …
In the spec thus far, I argue that the database industry has become overly
fixated on classifying databases into discrete categories each with their own
unique terminology (vertices/edge, tables/rows, documents, statements) and
overlapping operations. I believe the primary reason for this is that databases
are monolithic systems composed of a query language, a processing engine, and a
data storage system. When all these pieces are assembled by the database
engineering team, a “data perspective” (ADT) is set in stone.
I believe this has unnecessarily created database technology silos.
———
What we are trying to do at Apache TinkerPop is create a multi-model ADT that
spans the various database categories by using a generic lexicon and set of
operations capable of performing all database operations. People may argue:
“Why not just use the relational ADT and table/row lexicon as it can
emulate every other know ADT relatively naturally?”
I believe we are basically doing that with n-tuples (i.e. "schemaless rows").
However, what makes our approach unique is that our ADT doesn’t assume that it
will solely be used by a monolithic database system. Instead, our ADT is
designed on the assumption that the storage system, the processing engine, and
the query language are independent components that are ultimately integrated
into a “synthetic database” (a database that is custom assembled to meet the
data modeling and performance requirements of an end user’s applications).
Synthetic databases are possible with our multi-model ADT.
A multi-model ADT compliant property graph query language assumes a basic
property graph ADT embedding.
{graph}
V()
{vertex}
id()
label()
outE()
inE()
{edge}
id()
label()
outV()
inV()
The query language says that it understands map-tuples ({}) of type graph,
vertex, and edge. Moreover, along with core bytecode (has,values,…), it states
that these tuples should be able to be processed using by the provided property
graph-specific instructions. Thus,
g.V(1).out(‘knows’).values(‘name’)
=== Gremlin compiles to Basic Property Graph Bytecode ==>
V().filter(id().is(1)).outE().filter(label().is(‘knows’)).inV().values(‘name’)
Without strategies, the above bytecode would execute as a series of inefficient
linear “scan and filter” operations — the basic functional requirements of a
“property graph." However, the data storage system says that it has various
indices (and accessing instructions) for these types of tuples.
{graph}
V()
V(object)
{vertex}
id()
label()
outE()
inE()
out(string..)
in(string...)
{edge}
id()
label()
outV()
inV()
Thus, its property graph ADT extends the basic property graph ADT used by the
query language. This enables TP4 strategies to rewrite the submitted bytecode
to use the data storage system’s supported instructions (i.e. optimizations).
V().filter(id().is(1)).outE().filter(label().is(‘knows’)).inV().values(‘name’)
=== Property Graph Bytecode compiles to Data Storage System Optimized
Bytecode ==>
V(1).out(‘knows').values(‘name’)
This bytecode is then passed to the processing engine which seamlessly operates
on the data storage system’s tuples as defined by the instructions.
Question: What is the out(‘knows’) instruction? Simple, its a
FlatMapFunction<Vertex,Edge> that calls the following method on the TP4 Vertex
interface.
<Iterator<Edge>> Vertex.out(String… labels)
The data storage system says that its vertex tuple objects supports the
out(string…) instruction and thus, the data storage system is organizing
incident edges by label in its respective substrate (i.e. disk or memory).
Great!
IMPORTANT: Notice that our multi-model ADT’s operations are bytecode
instructions. There is no longer a concept of “pointers.” A pointer is simply a
map instruction! This means that our multi-model specification is not just a
data structure specification, but also a bytecode specification. I believe we
will ultimately have one spec driving TP4 VM development!
————————
!@#$@#^$%@&@%^@#$#
Now lets get crazy………..
@#$%@#$%@#$%@#$%^
Assume the data storage system actually produces a tuple with the following
information.
{ #pg.type:vertex, #rdbms.type:row, #pg.label:person, #id:1, name:marko, age:29
}
Ah ha! So this tuple is both a vertex and a row! What does that mean? It means
that we can ask the data storage system how it encodes the RDBMS ADT. Suppose
it tell us:
{database}
table(string)
{table}
database()
rows()
select(string,predicate)
project(string...)
join(table, predicate…)
{row}
asTable()
table()
This specification says that it produces database, table, and row tuples that
support the respective bytecode instructions.
We can now interpret the tuple as a Row and can operate on it as such.
{row}.asTable().join(database().table(‘companies’)).by(‘name’,’employee_name’).values(‘industry’)
In essence, if a “multi-model language” existed, the following
‘graph’/‘rdbms’-hybrid query would have been:
g.V(1).out(‘knows’).
sql(‘SELECT industry
FROM people, companies
WHERE people.id=$id AND people.name=employee_name’).by(id())
What is this frankenstein data structure? Its a RDBMS with foreign key
relations that can be traversed! Or its a graph database that supports dynamic
edge creation (i.e. vertex joins). Or, its just the multi-model ADT being
sweeeeeeet.
A few points to realize (*** Kuppitz will like this ***):
1. When we have a {table} tuple, we simply have a proxy/reference to (e.g.) the
respective MySQL table.
2. The processing engine isn’t pulling in the table’s data.
3. Because the data storage system supports join(), we let it do the work
instead of having the processing engine do it.
In conclusion.
Just like any other database ADT, Our multi-model ADT can be used to define
data structures and operations. However, our multi-model ADT can be used to
simulate other database ADTs using their native lexicon and operations.
Moreover, tuples in a our multi-model ADT can be used across multiple ADT
embeddings! Finally, query language, processing engine, and data storage system
stakeholders can all be different entities whose technologies interact with
each other via multi-model ADT tuples and bytecode, where the TP4 VM can
speedily glue it all together at query time using each component’s published
ADT embeddings.
Thoughts?,
Marko.
http://rredux.com <http://rredux.com/>