Hi Marko,

Thanks for the detailed emails. Responses inline.


On Thu, May 2, 2019 at 6:40 AM Marko Rodriguez <okramma...@gmail.com> wrote:

> [...]
> Thus, there exists a data model that can describe these database
> structures in a database agnostic manner.
>         - not in terms of tables, vertices, JSON, column families, …
>

100% with you on this.



> While we call this a “universal model” it is NOT more “general”
> (theoretically powerful) than any other database structure.
>

I agree. We should be trying harder to find equivalences, as opposed to
introducing a "bigger, better, brand-new shiny" data model.



> Reasons for creating a “universal model”.
>
>         1. To have a reduced set of objects for the TP4 VM to consider.
>                 - edges are just vertices with one incoming and outgoing
> “edge.”
>

Kinda. Let's say edges are elements with two fields. Vertices are elements
with no fields.



>                 - a column family is just a “map” of rows which are just
> “maps.”
>

Kinda. Let's say a table / column family is a data type with a number of
fields. Equivalently, it is a relation with a number of columns. You
brought up a good point in your previous email w.r.t. "person" vs.
"people", but that's why mappings are needed. A trivial schema mapping
gives you an element type "person" from a relation/table "people" and vice
versa. The table and the type are equivalent.



>                 - tables are just groupings of schema-equivalent rows.
>

Agreed. The "universal model" just makes an element out of each row.



>         2. To have a limited set of instructions in the TP4 bytecode
> specification.
>                 - outE/inE/outV/inV are just following direct “links”
> between objects.
>

inV and outV, yes, because they are fields of an edge element. outE and inE
are different, because they are not fields of the vertex. However, they are
functions. You can put them in the same namespace as inV and outV if you
want to; just keep in mind that in terms of relational algebra, they are a
fundamentally different operation.



>                 - has(), values(), keys(), valueMap(), etc. need not just
> apply to vertices and edges.
>

Agreed.



>         3. To have a simple serialization format.
>                 - we do not want to ship around
> rows/vertices/edges/documents/columns/etc.
>                 - we want to make it easy for other languages to integrate
> with the TP4 VM.
>                 - we want to make it easy to create TP4 VMs in other
> languages.
>

What is easier than a table? Any finite graph in this model is just a
collection of tables which can be shipped around as CSVs, among other
formats.



>         4. To have a theoretical understanding of the relationship between
> the various data structures.
>                 - “this is just a that” is useful to limit the
> complexities of our codebase and explain to the public how different
> database relate.
>

Yes.



> [...]
> The objects:
>         1. primitives: floats, doubles, Strings, ints, etc.
>

Yes.



>         2. tuples: key’d collections of primitives. (instances)
>         3. relations: groupings of tuples with ?equivalent? schemas.
> (types)
>

These are the same thing. A tuple is a row, is an element. A relation is a
set of elements/tuples/rows of the same type.

One more thing is needed: disjoint unions. I described these in my email on
algebraic property graphs. They are the "plus" operator to complement the
"times" operator in our type algebra. A disjoint union type is just like a
tuple type, but instead of having values for field a AND field b AND field
c, an instance of a union type has a value for field a XOR field b XOR
field c. Let me know if you are not completely sold on union types, and I
will provide additional motivation.



> The instructions:
>         1. relations can be “queried” for matching tuples.
>

Yes.



>         2. tuple values can be projected out to yield primitives.
>

Or other tuples, or tagged values. E.g. any edge projects to two vertices,
which are (trivial) tuples as opposed to primitive values.


Lets do a “traversal” from marko to the people he knows.
>
> // g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)
>
> db(‘person’).has(‘name’,’marko’).as(‘x’).
> db(‘knows’).has(‘#outV’, path(‘x’).by(‘#id’)).as(‘y’).
> db(‘person’).has(‘#id’, path(‘y’).by(‘#inV’)).
>   values(‘name’)
>

I still don't think we need the "db" step, but I think that syntax works --
you are distinguishing between fields and higher-order things like
properties by using hash characters for the field names.



> While the above is a single stream of processing, I will state what each
> line above has at that point in the stream.
>         - [#label:person,name:marko,age:29]
>

Keeping in mind that "name" and "age" are property keys as opposed to
fields, yes.



>         - [#label:knows,#outV:1,#inV:2,weight:0.5], ...
>         - [#label:person,name:vadas,age:27], ...
>         - vadas, ...
>

OK.



> Databases strategies can be smart to realize that only the #id or #inV or
> #outV of the previous object is required and thus, limit what is actually
> accessed and flow’d through the processing engine.
>         - [#id:1]
>         - [#id:0,#inV:2] …
>         - [#id:2,name:vadas] …
>         - vadas, ...
>

OK.



> *** More on such compiler optimizations (called strategies) later ***
>
> POSITIVE NOTES:
>
>         1. All relations are ‘siblings’ accessed via db().
>

Grumble... db() is just an alias for select()... grumble...



>                 - There is no concept of nesting data. A very flat
> structure.
>

Agreed.



>         2. All subsequent has()/where()/is()/etc.-filter steps after db()
> define the pattern match query.
>                 - It is completely up to the database to determine how to
> retrieve matching tuples.
>                 - For example: using indices, pointer chasing, linear
> scans w/ filter, etc.
>

And yet I think we should make certain indices explicit, as motivated in an
earlier email. That lets us do a certain amount of query optimization at
the TP level, as opposed to leaving all optimizations to the underlying
database.



>         3. All subsequent map()/flatmap()/etc. steps are projections of
> data in the tuple.
>                 - The database returns key’d tuples composed of primitives.
>                 - Primitive data can be accessed and further processed.
> (projections)
>

Cool.



>         4. The bytecode describes a computation that is irrespective of
> the underlying database’s encoding of that structure.
>                 - Amazon Neptune, MySQL, Cassandra, Spark, Hadoop, Ignite,
> etc. can be fed the same bytecode and will yield the same result.
>                 - In other words, given the example above. all databases
> can now process property graph traversals.
>

+1



> NEGATIVE NOTES:
>
>         1. Every database has to have a concept of grouping similar tuples.
>

Sure. That is the same as saying that every database needs to respect a
type system.



>         2. It implies an a priori definition of types (at least their
> existence and how to map data to them).
>

It does.



>         3. It implies a particular type of data model even though its
> represented using the “universal model."
>                 - the example above is a “property graph query” because of
> #outV, #inV, etc. schema’d keys.
>                 - the above example is a “vertex/edge-labeled property
> graph query”  because ‘person’ and ‘knows’ relations.
>                 - the above example implies that keys are unique to
> relations. (e.g. name=marko — why db(‘person’)?)
>                         - though db().has(‘name’,’marko’) can be used to
> search all relations.
>

Here, we are kind of mixing fields with property keys. Yes,
db().has('name', 'marko') can be used to search for elements of any type...
if that type agrees with the out-type of the "name" relation. In my
TinkerPop Classic example, the out type of "name" is (Person OR Project),
so your query will get you people or projects.



>         4. It requires the use of path()-data.
>                 - though we could come up with an efficient
> traverser.last() which returns the previous object touched.
>                 - However, for multi-db() relation matches, as().path()
> will have to be used.
>                         - This can be optimized out by property graph
> databases as they support pointer chasing. (** more on this later **)
>

I don't see how it requires path() data, but adhering to a type system will
mean that we have well-typed paths.


We can relax ‘apriori’-typing to enable ’name’=‘marko’ to be in any
> relation group, not just people relations. Also, lets use the concept of
> last() from (4).
>

Which is to say that we define the out-type of "name" to be the disjoint
union of all element types. The type becomes trivial. However, we can also
be more selective if we want to, restricting "name" only to a small subset
of types.



> [...]
> We can make typing completely dynamic and thus, relation groups don’t
> exist in the “universal model.” Thus, databases don’t have to even have a
> concept of groups of relations. However, databases can have relation groups
> via “indices" on #type, #type+#label, etc.
>

I agree that many applications will want a very relaxed type system. Others
will want to be more restrictive, which makes a lot more static analysis
possible. Union types provide a spectrum between dynamic and static typing.



> [...]
> The above really states that we are dealing with an “vertex/edge-labeled
> property graph”. This is not bad, because we already had the problem of the
> existence of #inV/#outE/etc. so this isn’t any more limiting. Next, TP4
> bytecode is starting to look like SPARQL pattern matching. There are tuples
> and we are matching patterns where data in some tuple equals (or general
> predicate) data in another tuple, etc. The “universal model” is just a
> sequence of key’d tuples with variable keys and lengths! (like an n-tuple
> store).
>

Yes indeed.


[...]
> All integrating database providers must support the “universal model" db()
> instruction. Its easy to implement, but is inefficient because bytecode
> using that instruction require a bunch of back-and-forths of data from DB
> to TP4 VM. Thus, TP4 will provide strategies to map db().filter()*-bytecode
> (i.e. universal model instructions) to instructions that respect their
> native structure.
>

If you mean that we will want to push operations down to the DB when
possible (for example, SQL queries for a relational database, SPARQL
queries for an RDF triple store with its own optimizations, etc.) then I
agree.



> Every database provider implements the TP4 interfaces that captures their
> native database encoding.
>         - For example, RDBMS:
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdbms
> <
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdbms
> >
>         - For example, Property Graph:
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/graph
> <
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/graph
> >
>         - For example, RDF:
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdf
> <
> https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdf
> >
>         - For example, Wide-Column…
>         - For example, Document…
>         - For example, HyperGraph…
>         - etc.
>

Good idea. TP4 can provide several "flavors" of interfaces, each of which
is idiomatic for each major class of database provider. Meeting the
providers halfway will make integration that much easier.



> TP4 will have lots of these interface packages (which will also include
> compiler strategies and instructions).
>

+1



> The db()-filter()* “universal model” bytecode is submitted to the TP4 VM.
> The TP4 VM looks at the integrated databases’ native structure (according
> to the interfaces it implements) and rewrites all db().filter()*-aspects of
> the submitted bytecode to a database-type specific instruction set that:
>         1. respects the semantics of the the underlying database encoding.
>         2. respects the semantics of TP4’s stream processing (i.e.
> linear/nested functions)
> For example, the previous “universal model" bytecode is rewritten for each
> database type as:
>
> Property graphs:
>         pg:V().has(‘name’,’marko’).pg:outE(‘knows’).pg:inV().values(‘name’)
>

I think we will see steps like V() and R() in Gremlin, but do not need them
in bytecode. Again, db() is just select(), V() is just select(), etc. The
model-specific interfaces adapt V() to select() etc.



> RDBMS:
>   rdbms:R(‘person’).has(‘name’,’marko’)).
>     join(rdbms:R(‘knows’)).by(’#id’,eq(‘#outV’)).
>     join(rdbms:R(‘person’)).by(‘#inV’,eq(‘#id’)).values(‘name’)
>
> RDF:
>   rdf:T().has(’p’,’rdf:type’).has(‘o’,’foaf:Person’).as(‘a’).
>
> rdf:T().has(’s’,path(‘a’).by(’s’)).has(‘p’,’foaf:name’).has(‘o’,’marko^^xsd:string’).
>   rdf:T().has(’s’,path(‘a').by(’s’)).has(‘p’,’#outE’).as(‘b’).
>
> rdf:T().has(’s’,path(‘b').by(’o’)).has(‘p’,’rdf:type’).has(‘o’,’foaf:knows’).as(‘c’).
>   rdf:T().has(’s’,path(‘c’).by(‘o’)).has(‘p’,’#inV’).as(‘d’).
>   rdf:T().has(’s’,path(‘d’).by(‘o’)).has(‘p,’rdf:name’).values(‘o’)
>


Same comments. In the case of RDF, we may even want a deeper integration of
RDF types with APG / the universal model. E.g. the first expression just
becomes:

    select(foafPerson)

The second expression becomes:

    value("marko").select(foafName, "in").project("out")

...which you can rewrite with has(); I just think the above is clear w.r.t.
low-level operations. The value() is just providing a start of "marko",
which is a string value. No need for xsd:string if we have a deep mapping
between RDF and APG.



> Next, TP4 will have strategies that can be generally applied to each
> database-type to further optimize the bytecode.
>
> Property graphs:
>         pg:V().has(‘name’,’marko’).pg:out(‘knows’).values(‘name’)
>
> RDBMS:
>         rdbms:sql(“SELECT name FROM person,knows,person WHERE p1.id=knows.inV
> …”)
>
> RDF:
>         rdf:sparql(“SELECT ?e WHERE { ?x rdf:type foaf:Person. ?x
> foaf:name marko^^xsd …”)
>

Yes, nice. We can even take things a step further and decouple the query
language from the database. Have a property graph database, but want to
evaluate SPARQL? No problem. Have a relational database but want to do
Gremlin traversal? No worries.



> Finally, vendors can then apply their custom strategies. For instance, for
> JanusGraph:
>
>
> jg:v-index(’name’,’marko’,grab(‘out-edges')).jg:out(‘knows’,grab(‘in-vertex’,’name-property').values(‘name’)
>

Sure.



> * The “universal model” instruction set must be supported by every
> database type. [all databases]
>

+1



> * The database-type specific instructions (e.g. V(), sparql(), sql(),
> out(), etc.) are only required to be understood by databases that implement
> that type interface. [database class]
>

+ 0.5. I think the model-specific interfaces are a great idea, but that
doesn't mean we can't invoke sparql() on a non-RDF database, can't use V()
for Gremlin-style traversals over non-PG databases, etc. +1 to
vendor-specific strategies. Not sure about vendor-specific instructions; a
lot can be done in the mapping of relations to instructions which live
entirely within black box of the vendor code.



> * All  vendor-specific instructions (e.g. jg:v-index()) are only required
> to be understood by that particular database. [database instance]
>

Back to indexes. IMO there should be a vendor-neutral API. Even extremely
vendor-specific indexes like geotemporal indexes could be exposed through a
common API, e.g.

    select("Dropoffs", {lat:37.7740, lon:122.4149, time:1556899302149})

which resolves to a vendor-specific index.



> [...]
> The million dollar question:
>
>         "Why would you want to encode an X data structure into a database
> that natively supports a Y data structure?”
> [...]
>

I agree with your answers. Not much to add.



> And there you have it — I believe Apache TinkerPop is on the verge of
> offering a powerful new data(base) theory and technology.
>
>         The Database Virtual Machine
>

+1

I actually like your term "GMachine", and I don't think it's a bad idea to
keep "graph" front and center. Yes, TP4 shall have the flexibility to
interoperate with a variety of non-graph databases, but what it adds is a
unifying graph abstraction.


Josh

Reply via email to