And 40+ years of programming languages research too :-)
> On Apr 25, 2019, at 12:36 PM, Joshua Shinavier <[email protected]> wrote:
>
> +10. Great to see structure and process coming together in this way. I think
> the algebraic and relational idiom will serve TP4 well. ADTs + constraints
> are all we need in order to expose a wide variety of structured data models
> to TinkerPop. Traditional property graphs, hypergraphs, relational DBs, and
> anything expressed in typical data interchange languages like Protobuf,
> Thrift, Avro, etc. Graph I/O and graph stream processing becomes easier,
> because a graph is just a set of tuples. There are fewer barriers to formal
> analysis of schemas, queries, and mappings, and new forms of optimization
> become possible. 40+ years of database research is now more immediately
> applicable, and so is the wide world of category theory. Agreed that there
> are individual details to nail down, but the broad strokes of this are
> looking pretty good.
>
> Josh
>
>
> On Thu, Apr 25, 2019 at 10:46 AM Marko Rodriguez <[email protected]> wrote:
> Hello,
>
> This email proposes a TP4 bytecode specification that is agnostic to the
> underlying data structure and thus, is both:
>
> 1. Turing Complete: the instruction set has process-oriented
> instructions capable of expressing any algorithm (universal processing).
> 2. Pointer-Based: the instruction set has data-oriented instructions
> for moving through referential links in memory (universal structuring).
>
> Turing Completeness has already been demonstrated for TinkerPop using the
> following sub-instruction set.
> union(), repeat(), choose(), etc. // i.e. the standard program flow
> instructions
>
> We will focus on the universal structuring aspect of this proposed bytecode
> spec. This work is founded on Josh Shinavier’s Category Theoretic approach to
> data structures. My contribution has been to reformulate his ideas according
> to the idioms and requirements of TP4 and then deduce a set of TP4-specific
> implementation details.
>
> TP4 REQUIREMENTS:
> 1. The TP4 VM should be able to process any data structure (not just
> property graphs).
> 2. The TP4 VM should respect the lexicon of the data structure (not
> just embed the data structure into a property graph).
> 3. The TP4 VM should allow query languages to naturally process their
> respective data structures (standards compliant language compilation).
>
> Here is a set of axioms defining the structures and processes of a universal
> data structure.
>
> THE UNIVERSAL STRUCTURE:
> 1. There are 2 data read instructions — select() and project().
> 2. There are 2 data write instructions — insert() and delete().
> 3. There are 3 sorts of data — tuples, primitives, and sequences.
> - Tuples can be thought of as “key/value maps.”
> - Primitives are doubles, floats, integers, booleans, Strings,
> etc.
> - Sequences are contiguous streams of tuples and/or primitives.
> 4 Tuple data is accessed via keys.
> - A key is a primitive used for referencing a value in the
> tuple. (not just String keys)
> - A tuple can not have duplicate keys.
> - Tuple values can be tuples, primitives, or sequences.
>
> Popular data structures can be defined as specializations of this universal
> structure. In other words, the data structures used by relational databases,
> graphdbs, triplestores, document databases, column stores, key/value stores,
> etc. all demand a particular set of constraints on the aforementioned axioms.
>
> ////////////////////////////////////////////////////////////
> /// A Schema-Oriented Multi-Relational Structure (RDBMS) ///
> ////////////////////////////////////////////////////////////
>
> RDBMS CONSTRAINTS ON THE UNIVERSAL STRUCTURE:
> 1. There are an arbitrary number of global tuple sequences (tables)
> 2. All tuple keys are Strings. (column names)
> 3. All tuple values are primitives. (row values)
> 4. All tuples in the same sequence have the same keys. (tables have
> predefined columns)
> 5. All tuples in the same sequence have the same primitive value type
> for the same key. (tables have predefined row value types)
>
> Assume the following tables in a relational database.
>
> vertices
> id label name age
> 1 person marko 29
> 2 person josh 35
>
> edges
> id outV label inV
> 0 1 knows 2
>
> An SQL query is presented and then the respective TP4 bytecode is provided
> (using fluent notation vs. [op,arg*]*).
>
> // SELECT * FROM vertices WHERE id=1
> select(‘vertices’).has(‘id’,1)
> => v[1]
> // SELECT name FROM vertices WHERE id=1
> select(‘vertices’).has(‘id’,1).project('name’)
> => "marko"
> // SELECT * FROM edges WHERE outV=1
> select('edges’).has('outV’,1)
> => e[0][v[1]-knows->v[2]]
> // SELECT * FROM edges WHERE outV=(SELECT 'id' FROM vertices WHERE
> name=‘marko')
> select('edges’).has(‘outV’,within(select(‘vertices’).has(‘name’,’marko’).project(‘id’)))
>
> => e[0][v[1]-knows->v[2]]
> // SELECT vertices.* FROM edges,vertices WHERE outV=1 AND id=inV
> select(‘vertices’).has(‘id’,1).select('edges').by('outV',eq('id')).select('vertices').by('id',eq('inV'))
> => v[2]
>
> VARIATIONS:
> 1. Relational databases that support JSON-blobs values can have nested
> “JSON-constrained” tuple values.
> 2. Relational databases the materialize views create new tuple
> sequences.
>
> //////////////////////////////////////////////////////////////////
> /// A Schema-Less, Recursive, Bi-Relational Structure (GRAPHDB) //
> //////////////////////////////////////////////////////////////////
>
> GRAPHDB CONSTRAINTS ON THE UNIVERSAL STRUCTURE:
> 1. There are two sorts of tuples: vertex and edge tuples.
> 2. All tuples have an id key and a label key.
> - id keys reference primitive values.
> - label keys reference String values.
> 3. All vertices are in a tuple sequence denoted “V”.
> 4. Vertex tuples have an inE and outE key each containing edge tuples.
> 5. Edge tuples have an outV and inV key each containing a single vertex
> tuple.
> 6. An arbitrary number of additional String-based tuple keys exist for
> both vertices and edges.
>
> Assume the “classic TinkerPop graph” where marko knows josh and they have
> names and ages, etc.
> http://tinkerpop.apache.org/docs/current/images/tinkerpop-modern.png
>
> A Gremlin query is presented and then the respective TP4 bytecode is provided
> (using fluent notation vs [op,arg*]*).
>
> // g.V(1)
> select(‘V’).has(‘id’,1)
> => v[1]
> // g.V(1).label()
> select(‘V’).has(‘id’,1).project('label’)
> => “person"
> // g.V(1).values(‘name’)
> select(‘V’).has(‘id’,1).project(‘name’)
> => “marko”
> // g.V(1).outE('knows').inV().values('name')
> select(‘V’).has(‘id’,1).project('outE').has('label','knows').project('inV').project('name’)
>
> => “josh”, “vadas”, “peter”
> // g.V(1).out('knows').values('name')
> select(‘V’).has(‘id’,1).project('outE').has('label','knows').project('inV').project('name’)
>
> => “josh”, “vadas”, “peter"
> // g.V().has(‘age’,gt(25)).values(‘age’).min()
> select(‘V’).has(‘age’,gt(25)).project(‘age’).min()
> => 29
> // g.V(1).as(‘a’).out(‘knows’).addE(‘likes’).to(‘a’)
> select(‘V’).has(‘id’,1).as(‘a’).
> project(‘outE’).has(‘label’,’knows’).
> project(‘inV’).
> project(‘inE’).insert(identity(),’likes’,path(‘a’))
> => e[7][v[2]-likes->v[1]], ...
>
> VARIATIONS:
> 1. Schema-based property graphs have the same tuple keys for all vertex
> (and edge) tuples of the same label.
> 2. Multi-labeled property graphs have String sequences for vertex
> labels.
> 3. Multi-property property graphs have primitive sequences for property
> keys.
> 4. Meta-property property graphs have an alternating sequence
> primitives (values) and tuples (properties).
> 5. Multi/meta-property property graphs support repeated values in the
> alternating sequences of the meta-property property graphs.
>
> //////////////////////////////////////////////////////////////////////////
> /// A Web-Schema, 3-Tuple Single Relational Structure (RDF TRIPLESTORE) //
> //////////////////////////////////////////////////////////////////////////
>
> RDF TRIPLESTORE CONSTRAINTS ON THE UNIVERSAL STRUCTURE:
> 1. There is only one type of tuple called a triple and it is denoted T.
> 2. All tuples have a s, p, and o key. (subject, predicate, object)
> - s keys reference String values representing URIs or blank
> nodes.
> - p keys reference String values representing URIs.
> - o keys reference String values representing URIs, blank
> nodes, or literals.
>
> A SPARQL query is presented and then the respective TP4 bytecode is provided
> (using fluent notation vs [op,arg*]*).
> - assume ex: is the namespace prefix for http://example.com#
>
> // SELECT ?x WHERE { ex:marko ex:name ?x }
> select(’T').by(’s’,is(‘ex:marko')).has(‘p’,’ex:name').project(‘o') ->
> "marko"^^xsd:string
> // SELECT ?y WHERE { ex:marko ex:knows ?x. ?x ex:name ?y }
> select(’T').by(’s’, is(‘ex:marko')).has(‘p’,’ex:knows’).
> select(’T’).by(’s’,eq(‘o’)).has(‘p’,’ex:name’).
> project(‘o’) -> “josh"^^xsd:string
>
> VARIATIONS:
> 1. Quadstores generalize RDF with a 4-tuple containing a ‘g’ key
> (called the "named graph").
> 2. RDF* generalizes RDF by extending T triples with an arbitrary number
> of String keys (thus, (spo+x*)-tuples)).
>
> ///////////////////////////////////////////////////////////////////////
> /// A Schema-Less, Type-Nested, 3-Relational Structure (DOCUMENTDB) //
> ///////////////////////////////////////////////////////////////////////
>
> DOCUMENTDB CONSTRAINTS ON THE UNIVERSAL STRUCTURE:
> 1. There are sorts of tuples: documents, lists, and maps.
> 2. Document tuples are in a sequence denoted D.
> 3. All document tuples have an _id key.
> 4. All tuple keys are string keys.
> 5. All tuple values are either primitives, lists, or maps. (JSON-style)
> - primitives are either long, double, String, or boolean.
>
> ///////////////////////////////////////////////////////////
> /// A Schema-Less Multi-Relational Structure (BIGTABLE) ///
> ///////////////////////////////////////////////////////////
>
> BIGTABLE CONSTRAINTS ON THE UNIVERSAL STRUCTURE: (CASSANDRA)
> 1. There are an arbitrary number of global tuple sequences. (column
> families)
> 2. All tuple keys are Strings. (column key)
> 3. All tuple values are primitives. (row values)
>
> /////////////////////////////////////////////////////////////////////////
> /// A Schema-Less 1-Tuple Single-Relational Structure (KEYVALUESTORE) ///
> /////////////////////////////////////////////////////////////////////////
>
> KEYVALUESTORE CONSTRAINTS ON THE UNIVERSAL STRUCTURE:
> 1. There is one global tuple sequence.
> 2. Each tuple has a single key/value.
> 3. All tuple keys must be unique.
>
> ————————————————————————————————————————————————————————
> ————————————————————————————————————————————————————————
> ————————————————————————————————————————————————————————
>
> WHY SEQUENCES?
>
> In property graphs, V is a tuple sequence.
>
> // g.V()
> select(‘V’)
> => v[1], v[2], v[3],...
> // g.V(1)
> select(‘V’).has(‘id’,1)
> => v[1]
> // g.V().has(‘name’,’marko’)
> select(‘V’).has(‘name’,’marko’)
> => v[1]
>
> If V were a tuple of vertex tuples, then V would be indexed by keys. This
> situation incorporates superfluous “indexing” which contains information
> auxiliary to the data structure.
>
> // g.V()
> select(‘V’).values()
> => v[1], v[2], v[3], ...
> // g.V(1)
> select(‘V’).project(‘1’)
> => v[1]
> // g.V().has(‘name’,’marko’)
> select(‘V’).values().has(‘name’,’marko’)
> => v[1]
>
> In property graphs, outE is a tuple sequence unique to each vertex.
>
> // g.V(1).outE()
> select(‘V’).has(‘id’,1).project(‘outE’)
> => e[0][v[1]-knows->v[2]]...
>
> In all structures, if x is a key and the value is a tuple, then x references
> a tuple sequence with a single tuple. In particular, for binary property
> graphs, project(‘inV’) is a single tuple sequence.
>
> // g.V(1).outE(’knows').inV()
> select(‘V’).has(‘id’,1).project(‘outE’).has(‘label’,’knows’).project(‘inV’)
> => e[0][v[1]-knows->v[2]]
>
> ————————————————————————————————————————————————————————
>
> JOINS VS. TRAVERSALS
>
> In relational databases, a join is the random-access movement from one tuple
> to another tuple given some predicate.
>
> // g.V().outE().label()
> // SELECT edges.label FROM vertices, edges WHERE vertices.id = edges.outV
> select(‘vertices’).select(‘edges’).by(id,eq(‘outV’)).project(‘label’)
> => knows, knows, created, ...
>
> In a graph database, the structure is assumed to be "pre-joined" and thus, a
> selection criteria is not needed.
>
> // g.V().outE().label()
> select(‘vertices’).project(‘outE’).project(‘label’)
> => knows, knows, created, …
>
> A latter section will discuss modeling one data structure within another data
> structure within the universal data structure. In such situations, joins may
> become traversals given Tuple data caching and function-based tuple values.
>
> ————————————————————————————————————————————————————————
>
> THE STRUCTURE PROVIDER OBJECTS OF A TP4 VIRTUAL MACHINE
>
> Data structure providers (i.e. database systems) interact with the TP4 VM via
> objects that implement:
>
> 1. Tuple
> - TupleSequence select(String key, Predicate join)
> - V project(K key)
> 2. TupleSequence
> - TupleSequence implements Iterator<Tuple>
> 3. and primitives such as String, long, int, etc.
>
> These interfaces may maintain complex behaviors that take unique advantage of
> the underlying data system (e.g. index use, function-oriented tuple values,
> data caching, etc.).
>
> This means that the serializers to/from TP4 need only handle the following
> object types:
>
> - Bytecode
> - Traverser
> - Primitives (String, long, int, etc.)
> - TupleProxy
> - TupleSequenceProxy
>
> This significantly simplifies the binary serialization format relative to TP3.
>
> Tuple’s are never serialized/deserialized as they may be complex,
> provider-specific objects with runtime functional behavior that interact with
> the underlying data system over sockets/etc. Thus, every Tuple implementation
> must have a corresponding TupleProxy which provides an appropriate
> “toString()” of the object and enough static data to reconnect the TupleProxy
> to the Tuple within the VM. This latter requirement is necessary for
> distributed processing infrastructures that migrate traversers between
> machines.
>
> - TupleProxy
> - Tuple attach(Structure provider)
> - String toString()
> - Tuple
> - TupleProxy detach()
>
> Similar methods exist for TupleSequence.
>
> ————————————————————————————————————————————————————————
>
> MODELING ONE DATA STRUCTURE WITHIN ANOTHER DATA STRUCTURE WITHIN THE
> UNIVERSAL DATA STRUCTURE
>
> The “shape of the data" determines the nature of the data structure. The
> question that arises is: What happens when a property graph is encoded in a
> relational database and property graph-oriented bytecode is used? This will
> yield exceptions as relational databases can not nest tables. To solve this
> in practice, it is possible to:
>
> 1. Rewrite the property graph bytecode to respective “property graph
> encoded” relational bytecode using a compiler strategy. (via a decoration
> strategy)
> 2. Develop Tuple objects that maintain functions for values that map
> the data accordingly.
>
> With respects to 2.
>
> VertexRow implements Tuple
>
> v = new VertexRow(1)
> v.project(‘outE’) // this method (for outE) computes the edge tuple sequence
> via SELECT * FROM edges WHERE outV=1
> => a bunch of EdgeRowTuples with similar function-based
> projection and selection operations.
>
> It is possible to model any data structure within any database system,
> regardless of the native data structure supported by the database. In TP4,
> databases are understood to be memory systems (like RAM) optimized for
> particular tuple-structures (e.g. sequential access vs. random access,
> index-based vs. nested based, cache support, etc.).
>
> ————————————————————————————————————————————————————————
>
> CUSTOM DATA STRUCTURE INSTRUCTIONS
>
> In order to make provider bytecode compilation strategies easy to write,
> DecorationStrategies can exist that translate “universal data structure”
> read/write instructions to “specific data structure” read/write instructions.
> For example, a property graph specific read/write instruction set is
> namespaced as “pg” and the following bytecode
>
> // g.V(1).out(‘knows’).values(‘name’)
> [[select,V] [has,id,eq(1)] [project,outE] [has,label,eq(knows)] [select,inV]
> [project,name]]
> ===strategizes to (via PropertyGraphDecorationStrategy)===>
> [[pg:V,1] [pg:out,knows] [pg:values, name]]
>
> Note that only read/write instructions need to be considered. All other
> instructions such as repeat(), choose(), count(), union(), is(), where(),
> match(), etc. are not effected by the underlying data structure.
>
> ————————————————————————————————————————————————————————
>
> There is still much more to discuss in terms of how all this relates to query
> languages and processors. While I haven’t considered all forks in the road, I
> have yet to hit any contradictions or inconsistencies when looking a
> few-steps ahead of where this email is now.
>
> Thank you for reading. Your thoughts are more than welcome.
>
> Take care,
> Marko.
>
> http://rredux.com
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Gremlin-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gremlin-users/6A0FF63D-8591-41AC-B024-2FF5629B4CD5%40gmail.com.
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Gremlin-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gremlin-users/CAPKNUSv2jPxbuuLBLO7F4JQGsU_x6Bp1HXPA2FVD8dWXOwx8Aw%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.