Hello,
This email proposes a TP4 bytecode specification that is agnostic to the
underlying data structure and thus, is both:
1. Turing Complete: the instruction set has process-oriented
instructions capable of expressing any algorithm (universal processing).
2. Pointer-Based: the instruction set has data-oriented instructions
for moving through referential links in memory (universal structuring).
Turing Completeness has already been demonstrated for TinkerPop using the
following sub-instruction set.
union(), repeat(), choose(), etc. // i.e. the standard program flow
instructions
We will focus on the universal structuring aspect of this proposed bytecode
spec. This work is founded on Josh Shinavier’s Category Theoretic approach to
data structures. My contribution has been to reformulate his ideas according to
the idioms and requirements of TP4 and then deduce a set of TP4-specific
implementation details.
TP4 REQUIREMENTS:
1. The TP4 VM should be able to process any data structure (not just
property graphs).
2. The TP4 VM should respect the lexicon of the data structure (not
just embed the data structure into a property graph).
3. The TP4 VM should allow query languages to naturally process their
respective data structures (standards compliant language compilation).
Here is a set of axioms defining the structures and processes of a universal
data structure.
THE UNIVERSAL STRUCTURE:
1. There are 2 data read instructions — select() and project().
2. There are 2 data write instructions — insert() and delete().
3. There are 3 sorts of data — tuples, primitives, and sequences.
- Tuples can be thought of as “key/value maps.”
- Primitives are doubles, floats, integers, booleans, Strings,
etc.
- Sequences are contiguous streams of tuples and/or primitives.
4 Tuple data is accessed via keys.
- A key is a primitive used for referencing a value in the
tuple. (not just String keys)
- A tuple can not have duplicate keys.
- Tuple values can be tuples, primitives, or sequences.
Popular data structures can be defined as specializations of this universal
structure. In other words, the data structures used by relational databases,
graphdbs, triplestores, document databases, column stores, key/value stores,
etc. all demand a particular set of constraints on the aforementioned axioms.
////////////////////////////////////////////////////////////
/// A Schema-Oriented Multi-Relational Structure (RDBMS) ///
////////////////////////////////////////////////////////////
RDBMS CONSTRAINTS ON THE UNIVERSAL STRUCTURE:
1. There are an arbitrary number of global tuple sequences (tables)
2. All tuple keys are Strings. (column names)
3. All tuple values are primitives. (row values)
4. All tuples in the same sequence have the same keys. (tables have
predefined columns)
5. All tuples in the same sequence have the same primitive value type
for the same key. (tables have predefined row value types)
Assume the following tables in a relational database.
vertices
id label name age
1 person marko 29
2 person josh 35
edges
id outV label inV
0 1 knows 2
An SQL query is presented and then the respective TP4 bytecode is provided
(using fluent notation vs. [op,arg*]*).
// SELECT * FROM vertices WHERE id=1
select(‘vertices’).has(‘id’,1)
=> v[1]
// SELECT name FROM vertices WHERE id=1
select(‘vertices’).has(‘id’,1).project('name’)
=> "marko"
// SELECT * FROM edges WHERE outV=1
select('edges’).has('outV’,1)
=> e[0][v[1]-knows->v[2]]
// SELECT * FROM edges WHERE outV=(SELECT 'id' FROM vertices WHERE name=‘marko')
select('edges’).has(‘outV’,within(select(‘vertices’).has(‘name’,’marko’).project(‘id’)))
=> e[0][v[1]-knows->v[2]]
// SELECT vertices.* FROM edges,vertices WHERE outV=1 AND id=inV
select(‘vertices’).has(‘id’,1).select('edges').by('outV',eq('id')).select('vertices').by('id',eq('inV'))
=> v[2]
VARIATIONS:
1. Relational databases that support JSON-blobs values can have nested
“JSON-constrained” tuple values.
2. Relational databases the materialize views create new tuple
sequences.
//////////////////////////////////////////////////////////////////
/// A Schema-Less, Recursive, Bi-Relational Structure (GRAPHDB) //
//////////////////////////////////////////////////////////////////
GRAPHDB CONSTRAINTS ON THE UNIVERSAL STRUCTURE:
1. There are two sorts of tuples: vertex and edge tuples.
2. All tuples have an id key and a label key.
- id keys reference primitive values.
- label keys reference String values.
3. All vertices are in a tuple sequence denoted “V”.
4. Vertex tuples have an inE and outE key each containing edge tuples.
5. Edge tuples have an outV and inV key each containing a single vertex
tuple.
6. An arbitrary number of additional String-based tuple keys exist for
both vertices and edges.
Assume the “classic TinkerPop graph” where marko knows josh and they have names
and ages, etc.
http://tinkerpop.apache.org/docs/current/images/tinkerpop-modern.png
<http://tinkerpop.apache.org/docs/current/images/tinkerpop-modern.png>
A Gremlin query is presented and then the respective TP4 bytecode is provided
(using fluent notation vs [op,arg*]*).
// g.V(1)
select(‘V’).has(‘id’,1)
=> v[1]
// g.V(1).label()
select(‘V’).has(‘id’,1).project('label’)
=> “person"
// g.V(1).values(‘name’)
select(‘V’).has(‘id’,1).project(‘name’)
=> “marko”
// g.V(1).outE('knows').inV().values('name')
select(‘V’).has(‘id’,1).project('outE').has('label','knows').project('inV').project('name’)
=> “josh”, “vadas”, “peter”
// g.V(1).out('knows').values('name')
select(‘V’).has(‘id’,1).project('outE').has('label','knows').project('inV').project('name’)
=> “josh”, “vadas”, “peter"
// g.V().has(‘age’,gt(25)).values(‘age’).min()
select(‘V’).has(‘age’,gt(25)).project(‘age’).min()
=> 29
// g.V(1).as(‘a’).out(‘knows’).addE(‘likes’).to(‘a’)
select(‘V’).has(‘id’,1).as(‘a’).
project(‘outE’).has(‘label’,’knows’).
project(‘inV’).
project(‘inE’).insert(identity(),’likes’,path(‘a’))
=> e[7][v[2]-likes->v[1]], ...
VARIATIONS:
1. Schema-based property graphs have the same tuple keys for all vertex
(and edge) tuples of the same label.
2. Multi-labeled property graphs have String sequences for vertex
labels.
3. Multi-property property graphs have primitive sequences for property
keys.
4. Meta-property property graphs have an alternating sequence
primitives (values) and tuples (properties).
5. Multi/meta-property property graphs support repeated values in the
alternating sequences of the meta-property property graphs.
//////////////////////////////////////////////////////////////////////////
/// A Web-Schema, 3-Tuple Single Relational Structure (RDF TRIPLESTORE) //
//////////////////////////////////////////////////////////////////////////
RDF TRIPLESTORE CONSTRAINTS ON THE UNIVERSAL STRUCTURE:
1. There is only one type of tuple called a triple and it is denoted T.
2. All tuples have a s, p, and o key. (subject, predicate, object)
- s keys reference String values representing URIs or blank
nodes.
- p keys reference String values representing URIs.
- o keys reference String values representing URIs, blank
nodes, or literals.
A SPARQL query is presented and then the respective TP4 bytecode is provided
(using fluent notation vs [op,arg*]*).
- assume ex: is the namespace prefix for http://example.com#
<http://example.com/#>
// SELECT ?x WHERE { ex:marko ex:name ?x }
select(’T').by(’s’,is(‘ex:marko')).has(‘p’,’ex:name').project(‘o') ->
"marko"^^xsd:string
// SELECT ?y WHERE { ex:marko ex:knows ?x. ?x ex:name ?y }
select(’T').by(’s’, is(‘ex:marko')).has(‘p’,’ex:knows’).
select(’T’).by(’s’,eq(‘o’)).has(‘p’,’ex:name’).
project(‘o’) -> “josh"^^xsd:string
VARIATIONS:
1. Quadstores generalize RDF with a 4-tuple containing a ‘g’ key
(called the "named graph").
2. RDF* generalizes RDF by extending T triples with an arbitrary number
of String keys (thus, (spo+x*)-tuples)).
///////////////////////////////////////////////////////////////////////
/// A Schema-Less, Type-Nested, 3-Relational Structure (DOCUMENTDB) //
///////////////////////////////////////////////////////////////////////
DOCUMENTDB CONSTRAINTS ON THE UNIVERSAL STRUCTURE:
1. There are sorts of tuples: documents, lists, and maps.
2. Document tuples are in a sequence denoted D.
3. All document tuples have an _id key.
4. All tuple keys are string keys.
5. All tuple values are either primitives, lists, or maps. (JSON-style)
- primitives are either long, double, String, or boolean.
///////////////////////////////////////////////////////////
/// A Schema-Less Multi-Relational Structure (BIGTABLE) ///
///////////////////////////////////////////////////////////
BIGTABLE CONSTRAINTS ON THE UNIVERSAL STRUCTURE: (CASSANDRA)
1. There are an arbitrary number of global tuple sequences. (column
families)
2. All tuple keys are Strings. (column key)
3. All tuple values are primitives. (row values)
/////////////////////////////////////////////////////////////////////////
/// A Schema-Less 1-Tuple Single-Relational Structure (KEYVALUESTORE) ///
/////////////////////////////////////////////////////////////////////////
KEYVALUESTORE CONSTRAINTS ON THE UNIVERSAL STRUCTURE:
1. There is one global tuple sequence.
2. Each tuple has a single key/value.
3. All tuple keys must be unique.
————————————————————————————————————————————————————————
————————————————————————————————————————————————————————
————————————————————————————————————————————————————————
WHY SEQUENCES?
In property graphs, V is a tuple sequence.
// g.V()
select(‘V’)
=> v[1], v[2], v[3],...
// g.V(1)
select(‘V’).has(‘id’,1)
=> v[1]
// g.V().has(‘name’,’marko’)
select(‘V’).has(‘name’,’marko’)
=> v[1]
If V were a tuple of vertex tuples, then V would be indexed by keys. This
situation incorporates superfluous “indexing” which contains information
auxiliary to the data structure.
// g.V()
select(‘V’).values()
=> v[1], v[2], v[3], ...
// g.V(1)
select(‘V’).project(‘1’)
=> v[1]
// g.V().has(‘name’,’marko’)
select(‘V’).values().has(‘name’,’marko’)
=> v[1]
In property graphs, outE is a tuple sequence unique to each vertex.
// g.V(1).outE()
select(‘V’).has(‘id’,1).project(‘outE’)
=> e[0][v[1]-knows->v[2]]...
In all structures, if x is a key and the value is a tuple, then x references a
tuple sequence with a single tuple. In particular, for binary property graphs,
project(‘inV’) is a single tuple sequence.
// g.V(1).outE(’knows').inV()
select(‘V’).has(‘id’,1).project(‘outE’).has(‘label’,’knows’).project(‘inV’)
=> e[0][v[1]-knows->v[2]]
————————————————————————————————————————————————————————
JOINS VS. TRAVERSALS
In relational databases, a join is the random-access movement from one tuple to
another tuple given some predicate.
// g.V().outE().label()
// SELECT edges.label FROM vertices, edges WHERE vertices.id = edges.outV
select(‘vertices’).select(‘edges’).by(id,eq(‘outV’)).project(‘label’)
=> knows, knows, created, ...
In a graph database, the structure is assumed to be "pre-joined" and thus, a
selection criteria is not needed.
// g.V().outE().label()
select(‘vertices’).project(‘outE’).project(‘label’)
=> knows, knows, created, …
A latter section will discuss modeling one data structure within another data
structure within the universal data structure. In such situations, joins may
become traversals given Tuple data caching and function-based tuple values.
————————————————————————————————————————————————————————
THE STRUCTURE PROVIDER OBJECTS OF A TP4 VIRTUAL MACHINE
Data structure providers (i.e. database systems) interact with the TP4 VM via
objects that implement:
1. Tuple
- TupleSequence select(String key, Predicate join)
- V project(K key)
2. TupleSequence
- TupleSequence implements Iterator<Tuple>
3. and primitives such as String, long, int, etc.
These interfaces may maintain complex behaviors that take unique advantage of
the underlying data system (e.g. index use, function-oriented tuple values,
data caching, etc.).
This means that the serializers to/from TP4 need only handle the following
object types:
- Bytecode
- Traverser
- Primitives (String, long, int, etc.)
- TupleProxy
- TupleSequenceProxy
This significantly simplifies the binary serialization format relative to TP3.
Tuple’s are never serialized/deserialized as they may be complex,
provider-specific objects with runtime functional behavior that interact with
the underlying data system over sockets/etc. Thus, every Tuple implementation
must have a corresponding TupleProxy which provides an appropriate “toString()”
of the object and enough static data to reconnect the TupleProxy to the Tuple
within the VM. This latter requirement is necessary for distributed processing
infrastructures that migrate traversers between machines.
- TupleProxy
- Tuple attach(Structure provider)
- String toString()
- Tuple
- TupleProxy detach()
Similar methods exist for TupleSequence.
————————————————————————————————————————————————————————
MODELING ONE DATA STRUCTURE WITHIN ANOTHER DATA STRUCTURE WITHIN THE UNIVERSAL
DATA STRUCTURE
The “shape of the data" determines the nature of the data structure. The
question that arises is: What happens when a property graph is encoded in a
relational database and property graph-oriented bytecode is used? This will
yield exceptions as relational databases can not nest tables. To solve this in
practice, it is possible to:
1. Rewrite the property graph bytecode to respective “property graph
encoded” relational bytecode using a compiler strategy. (via a decoration
strategy)
2. Develop Tuple objects that maintain functions for values that map
the data accordingly.
With respects to 2.
VertexRow implements Tuple
v = new VertexRow(1)
v.project(‘outE’) // this method (for outE) computes the edge tuple sequence
via SELECT * FROM edges WHERE outV=1
=> a bunch of EdgeRowTuples with similar function-based
projection and selection operations.
It is possible to model any data structure within any database system,
regardless of the native data structure supported by the database. In TP4,
databases are understood to be memory systems (like RAM) optimized for
particular tuple-structures (e.g. sequential access vs. random access,
index-based vs. nested based, cache support, etc.).
————————————————————————————————————————————————————————
CUSTOM DATA STRUCTURE INSTRUCTIONS
In order to make provider bytecode compilation strategies easy to write,
DecorationStrategies can exist that translate “universal data structure”
read/write instructions to “specific data structure” read/write instructions.
For example, a property graph specific read/write instruction set is namespaced
as “pg” and the following bytecode
// g.V(1).out(‘knows’).values(‘name’)
[[select,V] [has,id,eq(1)] [project,outE] [has,label,eq(knows)] [select,inV]
[project,name]]
===strategizes to (via PropertyGraphDecorationStrategy)===>
[[pg:V,1] [pg:out,knows] [pg:values, name]]
Note that only read/write instructions need to be considered. All other
instructions such as repeat(), choose(), count(), union(), is(), where(),
match(), etc. are not effected by the underlying data structure.
————————————————————————————————————————————————————————
There is still much more to discuss in terms of how all this relates to query
languages and processors. While I haven’t considered all forks in the road, I
have yet to hit any contradictions or inconsistencies when looking a few-steps
ahead of where this email is now.
Thank you for reading. Your thoughts are more than welcome.
Take care,
Marko.
http://rredux.com <http://rredux.com/>