Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Marko Rodriguez Thu, 02 May 2019 06:40:51 -0700

Hey Josh (others),

I was thinking of our recent divergence in thought. I thought it would be smart 
for me to summarize where we are and to do my best to describe your model so as 
to better understand your perspective and to help you better understand how 
your model will ultimately execute on the TP4 VM.


############################
# WHY A UNIVERSAL MODEL? #
###########################

Every database data model can be losslessly embedded in every other database 
data model.
        - e.g. you can embed a property graph structure in a relational 
structure.
        - e.g. you can embed a document structure in a property graph structure.
        - e.g. you can embed a wide-column structure in a document structure.
        - …
        - e.g. you can embed a property graph structure in a Hadoop sequence 
file or Spark RDD.

Thus, there exists a data model that can describe these database structures in 
a database agnostic manner.
        - not in terms of tables, vertices, JSON, column families, …

While we call this a “universal model” it is NOT more “general” (theoretically 
powerful) than any other database structure.

Reasons for creating a “universal model”.

        1. To have a reduced set of objects for the TP4 VM to consider.
                - edges are just vertices with one incoming and outgoing “edge.”
                - a column family is just a “map” of rows which are just “maps.”
                - tables are just groupings of schema-equivalent rows.
                - …
        2. To have a limited set of instructions in the TP4 bytecode 
specification.
                - outE/inE/outV/inV are just following direct “links” between 
objects.
                - has(), values(), keys(), valueMap(), etc. need not just apply 
to vertices and edges.
                - …
        3. To have a simple serialization format.
                - we do not want to ship around 
rows/vertices/edges/documents/columns/etc.
                - we want to make it easy for other languages to integrate with 
the TP4 VM.
                - we want to make it easy to create TP4 VMs in other languages.
                - ...
        4. To have a theoretical understanding of the relationship between the 
various data structures.
                - “this is just a that” is useful to limit the complexities of 
our codebase and explain to the public how different database relate.

Without further ado...

########################
# THE UNIVERSAL MODEL #
########################

*** This is as I understand it. I will let Josh decide whether I captured his 
ideas correctly. ***
*** All subsequent x().y().z() expressions are BYTECODE, not GREMLIN (just 
using an easier syntax then [op,arg*]*. ***

The objects:
        1. primitives: floats, doubles, Strings, ints, etc.
        2. tuples: key’d collections of primitives. (instances)
        3. relations: groupings of tuples with ?equivalent? schemas. (types)

The instructions:
        1. relations can be “queried” for matching tuples.
        2. tuple values can be projected out to yield primitives.

Lets do a “traversal” from marko to the people he knows.

// g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)

db(‘person’).has(‘name’,’marko’).as(‘x’).
db(‘knows’).has(‘#outV’, path(‘x’).by(‘#id’)).as(‘y’).
db(‘person’).has(‘#id’, path(‘y’).by(‘#inV’)).
  values(‘name’)

While the above is a single stream of processing, I will state what each line 
above has at that point in the stream.
        - [#label:person,name:marko,age:29]
        - [#label:knows,#outV:1,#inV:2,weight:0.5], ...
        - [#label:person,name:vadas,age:27], ...
        - vadas, ...
Databases strategies can be smart to realize that only the #id or #inV or #outV 
of the previous object is required and thus, limit what is actually accessed 
and flow’d through the processing engine.
        - [#id:1]
        - [#id:0,#inV:2] …
        - [#id:2,name:vadas] …
        - vadas, ...
*** More on such compiler optimizations (called strategies) later ***

POSITIVE NOTES:

        1. All relations are ‘siblings’ accessed via db().
                - There is no concept of nesting data. A very flat structure.
        2. All subsequent has()/where()/is()/etc.-filter steps after db() 
define the pattern match query.
                - It is completely up to the database to determine how to 
retrieve matching tuples.
                - For example: using indices, pointer chasing, linear scans w/ 
filter, etc.
        3. All subsequent map()/flatmap()/etc. steps are projections of data in 
the tuple.
                - The database returns key’d tuples composed of primitives.
                - Primitive data can be accessed and further processed. 
(projections)
        4. The bytecode describes a computation that is irrespective of the 
underlying database’s encoding of that structure.
                - Amazon Neptune, MySQL, Cassandra, Spark, Hadoop, Ignite, etc. 
can be fed the same bytecode and will yield the same result.
                - In other words, given the example above. all databases can 
now process property graph traversals.

NEGATIVE NOTES:

        1. Every database has to have a concept of grouping similar tuples.
        2. It implies an a priori definition of types (at least their existence 
and how to map data to them).
        3. It implies a particular type of data model even though its 
represented using the “universal model."
                - the example above is a “property graph query” because of 
#outV, #inV, etc. schema’d keys.
                - the above example is a “vertex/edge-labeled property graph 
query”  because ‘person’ and ‘knows’ relations.
                - the above example implies that keys are unique to relations. 
(e.g. name=marko — why db(‘person’)?)
                        - though db().has(‘name’,’marko’) can be used to search 
all relations.
        4. It requires the use of path()-data.
                - though we could come up with an efficient traverser.last() 
which returns the previous object touched.
                - However, for multi-db() relation matches, as().path() will 
have to be used.
                        - This can be optimized out by property graph databases 
as they support pointer chasing. (** more on this later **)

We can relax ‘apriori’-typing to enable ’name’=‘marko’ to be in any relation 
group, not just people relations. Also, lets use the concept of last() from 
(4). 

// g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)

db(‘vertices’).has(‘name’,’marko’).
db(‘edges’).has(‘#label’,’knows’).has(‘#outV’, last().by(‘#id’)).
db(‘vertices’).has(‘#label’,’person’).has(‘#id’, 
last().by(‘#inV’)).values(‘name’)

We can make typing completely dynamic and thus, relation groups don’t exist in 
the “universal model.” Thus, databases don’t have to even have a concept of 
groups of relations. However, databases can have relation groups via “indices" 
on #type, #type+#label, etc.

// g.V().has(‘name’,’marko’).outE(‘knows’).inV().values(‘name’)

db().has(’#type’,’vertex’).has(‘name’,’marko’).
db().has(‘#type’,’edge’).has(‘#label’,’knows’).has(‘#outV’, last().by(‘#id’)).
db().has(‘#type’,’vertex’).has(‘#label’,’person’).has(‘#id’, 
last().by(‘#inV’)).values(‘name’)

The above really states that we are dealing with an “vertex/edge-labeled 
property graph”. This is not bad, because we already had the problem of the 
existence of #inV/#outE/etc. so this isn’t any more limiting. Next, TP4 
bytecode is starting to look like SPARQL pattern matching. There are tuples and 
we are matching patterns where data in some tuple equals (or general predicate) 
data in another tuple, etc. The “universal model” is just a sequence of key’d 
tuples with variable keys and lengths! (like an n-tuple store).

#############################################
# TP4 VM EXECUTION OF THE UNIVERSAL MODEL #
#############################################

All integrating database providers must support the “universal model" db() 
instruction. Its easy to implement, but is inefficient because bytecode using 
that instruction require a bunch of back-and-forths of data from DB to TP4 VM. 
Thus, TP4 will provide strategies to map db().filter()*-bytecode (i.e. 
universal model instructions) to instructions that respect their native 
structure.

Every database provider implements the TP4 interfaces that captures their 
native database encoding.
        - For example, RDBMS: 
https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdbms
 
<https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdbms>
        - For example, Property Graph: 
https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/graph
 
<https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/graph>
        - For example, RDF: 
https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdf
 
<https://github.com/apache/tinkerpop/tree/tp4/java/machine/machine-core/src/main/java/org/apache/tinkerpop/machine/structure/rdf>
        - For example, Wide-Column…
        - For example, Document…
        - For example, HyperGraph…
        - etc.
TP4 will have lots of these interface packages (which will also include 
compiler strategies and instructions).
        
The db()-filter()* “universal model” bytecode is submitted to the TP4 VM. The 
TP4 VM looks at the integrated databases’ native structure (according to the 
interfaces it implements) and rewrites all db().filter()*-aspects of the 
submitted bytecode to a database-type specific instruction set that:
        1. respects the semantics of the the underlying database encoding.
        2. respects the semantics of TP4’s stream processing (i.e. 
linear/nested functions)
For example, the previous “universal model" bytecode is rewritten for each 
database type as:

Property graphs:
        pg:V().has(‘name’,’marko’).pg:outE(‘knows’).pg:inV().values(‘name’)

RDBMS:
  rdbms:R(‘person’).has(‘name’,’marko’)).
    join(rdbms:R(‘knows’)).by(’#id’,eq(‘#outV’)).
    join(rdbms:R(‘person’)).by(‘#inV’,eq(‘#id’)).values(‘name’)
        
RDF:
  rdf:T().has(’p’,’rdf:type’).has(‘o’,’foaf:Person’).as(‘a’).
  
rdf:T().has(’s’,path(‘a’).by(’s’)).has(‘p’,’foaf:name’).has(‘o’,’marko^^xsd:string’).
  rdf:T().has(’s’,path(‘a').by(’s’)).has(‘p’,’#outE’).as(‘b’).
  
rdf:T().has(’s’,path(‘b').by(’o’)).has(‘p’,’rdf:type’).has(‘o’,’foaf:knows’).as(‘c’).
  rdf:T().has(’s’,path(‘c’).by(‘o’)).has(‘p’,’#inV’).as(‘d’).
  rdf:T().has(’s’,path(‘d’).by(‘o’)).has(‘p,’rdf:name’).values(‘o’)

Next, TP4 will have strategies that can be generally applied to each 
database-type to further optimize the bytecode.

Property graphs:
        pg:V().has(‘name’,’marko’).pg:out(‘knows’).values(‘name’)

RDBMS:
        rdbms:sql(“SELECT name FROM person,knows,person WHERE p1.id=knows.inV 
…”)
        
RDF:
        rdf:sparql(“SELECT ?e WHERE { ?x rdf:type foaf:Person. ?x foaf:name 
marko^^xsd …”)

Finally, vendors can then apply their custom strategies. For instance, for 
JanusGraph:

jg:v-index(’name’,’marko’,grab(‘out-edges')).jg:out(‘knows’,grab(‘in-vertex’,’name-property').values(‘name’)

* The “universal model” instruction set must be supported by every database 
type. [all databases]
* The database-type specific instructions (e.g. V(), sparql(), sql(), out(), 
etc.) are only required to be understood by databases that implement that type 
interface. [database class]
* All  vendor-specific instructions (e.g. jg:v-index()) are only required to be 
understood by that particular database. [database instance]

NOTES:
        1. Optimizations such as sql(), sparql(), etc. are only for bytecode 
fragments that can be universally optimized for that particular class of 
databases.
        2. Results from sql(), sparql(), etc. can be subjected to further TP4 
stream processing via repeat(), union(), choose(), etc. etc.
                - unfortunately my running example wasn’t complex enough to 
capture this. :(
                - the more we can pull out of TP4 bytecode and put into sql(), 
sparql(), etc. the better.
                - however, some query languages don’t have the respective 
expressivity for all types of computations (e.g. looping, branching, etc.).
                        - in such situations, processing moves from DB to TP4 
to DB to TP4 accordingly.
        3. We have an algorithmic way of mapping databases.
                - The RDBMS query shows there is a “property graph” encoded in 
tables.
                - The RDF query shows that there is a “property graph” encoded 
in triples.

In summary:

        1. There is a universal model and a universal instruction set.
        2. Databases integrate with the TP4 VM via “native database 
type”-interfaces.
        3. Submitted universal bytecode is rewritten to a database-type 
specific bytecode that respects the native semantics of that database-type. 
[decoration strategies]
        4. TP4 can further strategize that bytecode to take advantage of 
optimizations that are universal to that database-type. [optimization 
strategies]
        5. The underlying database can further strategize that bytecode to take 
unique advantage of their custom optimization features. [provider strategies]

################################
# WHY GO TO ALL THIS TROUBLE? #
################################

The million dollar question:
        
        "Why would you want to encode an X data structure into a database that 
natively supports a Y data structure?”

Answer:
        1. Its not just about databases, its about data formats in general.
                - The "universal model" allows database providers easy access 
to OLAP processors that have a different native structure than them.
                        E.g. Spark RDDs, Hadoop SequenceFiles, Beam tuples, ...
        2. In some scenarios, a Y-database is better at processing X-type data 
structure than the currently existing native X-databases.
                - E.g., JanusGraph is a successful graph database product that 
encodes a property graph in a wide-column store.
                        - JanusGraph provides graph sharding, distributed 
read/write from OLAP processing, high-concurrency, fault tolerance, global 
distribution, etc.
        3. Database providers can efficiently support other data structures 
that are simply "constrained versions" of their native structure. 
                - E.g., Amazon Neptune can support RDF even if their native 
structure is Property Graph.
                        - According to the “universal model,” RDF is a 
restriction on property graphs.
                                - RDF is just no properties and URI-based 
identifiers.
        4. “Agnostic” data(bases) such Redis, Ignite, Spark, etc. can easily 
support common data structures and their respective development communities.
                - With TP4, vendors can expand their product offering into 
communities they are only tangentially aware of.
                        - E.g. Redis can immediately “jump into” the RDF space 
without having background knowledge of that space.
                        - E.g. Ignite can immediately “jump into” the property 
graph space...
                        - E.g. Spark can immediately “jump into” the document 
space…
        5. All TP4-enabled processors automatically work over all TP4-enabled 
databases.
                - JanusGraph gets dynamic query routing with Akka.
                - Amazon Neptune gets multi-threaded query execution with 
rxJava.
                - ComosDB gets cluster-oriented OLAP query execution with Spark.
                - …
        6. Language designers that have compilers to TP4 bytecode can work with 
all supporting TP4 databases/processors.
                - Neo4j no longer has to convince vendors to implement Cypher.
                - Amazon doesn’t have to choose between Gremlin, SPARQL, 
Cypher, etc.
                        - Their customers can use their favorite language.
                                - Obviously, some languages are better at 
expressing certain computations than others (e.g. SQL over graphs is horrible).
                                - Some impedance mismatch issues can arise 
(e.g. RDF requires URIs for ids).
                - A plethora of new languages may emerge as designers don’t 
have to convince vendors to support it.
                        - Language designers only have to develop a compiler to 
TP4 bytecode.
                
And there you have it — I believe Apache TinkerPop is on the verge of offering 
a powerful new data(base) theory and technology.

        The Database Virtual Machine

Thanks for reading,
Marko.

http://rredux.com <http://rredux.com/>




> On Apr 30, 2019, at 4:47 PM, Marko Rodriguez <okramma...@gmail.com> wrote:
> 
> Hello,
> 
>> First, the "root". While we do need context for traversals, I don't think
>> there should be a distinct kind of root for each kind of structure. Once
>> again, select(), or operations derived from select() will work just fine.
> 
> So given your example below, “root” would be db in this case. 
> db is the reference to the structure as a whole.
> Within db, substructures exist. 
> Logically, this makes sense.
> For instance, a relational database’s references don’t leak outside the RDBMs 
> into other areas of your computer’s memory.
> And there is always one entry point into every structure — the connection. 
> And what does that connection point to:
>       vertices, keyspaces, databases, document collections, etc. 
> In other words, “roots.” (even the JVM has a “root” — it called the heap).
> 
>> Want the "person" table? db.select("person"). Want a sequence of vertices
>> with the label "person"? db.select("person"). What we are saying in either
>> case is "give me the 'person' relation. Don't project any specific fields;
>> just give me all the data". A relational DB and a property graph DB will
>> have different ways of supplying the relation, but in either case, it can
>> hide behind the same interface (TRelation?).
> 
> In your lexicon, for both RDBMS and graph:
>       db.select(‘person’) is saying, select the people table (which is 
> composed of a sequence of “person" rows)
>       db.select(‘person’) is saying, select the person vertices (which is 
> composed of a sequence of “person" vertices)
> …right off the bat you have the syntax-problem of people vs. person. Tables 
> are typically named the plural of the rows. That
> doesn’t exist in graph databases as there is just one vertex set (i.e. one 
> “table”).
> 
> In my lexicon (TP instructions)
>       db().values(‘people’) is saying, flatten out the person rows of the 
> people table.
>       V().has(label,’person’) is saying, flatten out the vertex objects of 
> the graph’s vertices and filter out non-person vertices.
> 
> Well, that is stupid, why not have the same syntax for both structures?
> Because they are different. There are no “person” relations in the classic 
> property graph (Neo4j 1.0). There are only vertex relations with a 
> label=person entry.
> In a relational database there are “person” relations and these are bundled 
> into disjoint tables (i.e. relation sets — and schema constrained).
> 
> The point I’m making is that instead of trying to fit all these data 
> structures into a strict type system that ultimately looks like
> a bunch of disjoint relational sets, lets mimic the vendor-specified 
> semantics. Lets take these systems at their face value
> and not try and “mathematize” them. If they are inconsistent and ugly, fine. 
> If we map them into another system that is mathematical
> and beautiful, great. However, every data structure, from Neo4j’s 
> representation for OLTP traversals
>  to that “same" data being OLAP processed as Spark RDDs or Hadoop
> SequenceFiles will all have their ‘oh shits’ (impedance mismatches) and that 
> is okay. As this is the reality we are tying to model!
> 
> Graph and RDBMs have two different data models (their unique worldview):
> 
> RDBMS:   Databases->Tables->Rows->Primitives
> GraphDB: Vertices->Edges->Vertices->Edges->Vertices-> ...
> 
> Here is a person->knows->person “traversal” in TP4 bytecode over an RDBMS 
> (#key are ’symbols’ (constants)):
> 
> db().values(“people”).as(“x”).
> db().values(“knows”).as(“y”).
>   where(“x”,eq(“y”)).by(#id).by(#outV).
> db().values(“people”).as(“z”).
>   where(“y”,eq(“z”)).by(#inV).by(#id)
>    
> Pretty freakin’ disgusting, eh? Here is a person->knows->person “traversal” 
> in TP4 bytecode over a property graph:
> 
> V().has(#label,”person”).values(#outE).has(#label,”knows”).values(#inV)
> 
> So we have two completely different bytecode representations for the same 
> computational result. Why?
> Because we have two completely different data models!
> 
>       One is a set of disjoint typed-relations (i.e. RDBMS).
>       One is a set of nested loosely-typed-relations (i.e. property graphs).
> 
> Why not make them the same? Because they are not the same and that is exactly 
> what I believe we should be capturing.
> 
> Just looking at the two computations above you see that a relational database 
> is doing “joins” while a graph database is doing “traversals”.
> We have to use path-data to compute a join. We have to use memory! (and we 
> do). We don’t have to use path-data to compute a traversal.
> We don’t have to use memory! (and we don’t!). That is the fundamental nature 
> of the respective computations that are taking place.
> That is what gives each system their particular style of computing.
> 
> NEXT: There is nothing that says you can’t map between the two? Lets go 
> property graph to RDBMS.
>       - we could make a person table, a software table, a knows table, a 
> created table.
>               - that only works if the property graph is schema-based.
>       - we could make a single vertex table with another 3 column properties 
> table (vertexId,key,value)
>       - we could…
> Which ever encoding you choose, a different bytecode will be required. 
> Fortunately, the space of (reasonable) possibilities is constrained.
> Thus, instead of saying: 
>       “I want to map from property graph to RDBMS” 
> I say: 
>       “I want to map from a recursive, bi-relational structure to a disjoint 
> multi-relational structure where linkage is based on #id/#outV/#inV 
> equalities.”
> Now you have constrained the space of possible RDBMS encodings! Moreover, we 
> now have an algorithmic solution that not only disconnects “vertices,” 
> but also rewrites the bytecode according to the new logical steps required to 
> execute the computation as we have a new data structure and a new
> way of moving through that data structure. The pointers are completely 
> different! However, as long as the mapping is sound, the rewrite should be 
> algorithmic.
> 
> I’m getting tired. I see your stuff below about indices and I have thoughts 
> on that… but I will address those tomorrow.
> 
> Thanks for reading,
> Marko.
> 
> http://rredux.com <http://rredux.com/>
> 
> 
> 
> 
> 
> 
> 
>> 
>> But wait, you say, what if the under the hood, you have a TTable in one
>> case, and TSequence in the other? They are so different! That's why
>> the Dataflow
>> Model
>> <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf
>>  
>> <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf>>
>> is so great; to an extent, you can think of the two as interchangeable. I
>> think we would get a lot of mileage out of treating them as interchangeable
>> within TP4.
>> 
>> So instead of a data model -specific "root", I argue for a universal root
>> together with a set of relations and what we might call an "indexes". An
>> index is an arrow from a type to a relation which says "give me a
>> column/value pair, and I will give you all matching tuples from this
>> relation". The result is another relation. Where data sources differentiate
>> themselves is by having different relations and indexes.
>> 
>> For example, if the underlying data structure is nothing but a stream of
>> Trip tuples, you will have a single relation "Trip", and no indexes. Sorry;
>> you just have to wait for tuples to go by, and filter on them. So if you
>> say d.select("Trip", "driver") -- where d is a traversal that gets you to a
>> User -- the machine knows that it can't use "driver" to look up a specific
>> set of trips; it has to use a filter over all future "Trip" tuples. If, on
>> the other hand, we have a relational database, we have the option of
>> indexing on "driver". In this case, d.select("Trip", "driver") may take you
>> to a specific table like "Trip_by_driver" which has "driver" as a primary
>> key. The machine recognizes that this index exists, and uses it to answer
>> the query more efficiently. The alternative is to do a full scan over any
>> table which contains the "Trip" relation. Since TinkerPop3, we have been
>> without a vendor-neutral API for indexes, but this is where such an API
>> would really start to shine. Consider Neo4j's single property indexes,
>> JanusGraph's composite indexes, and even RDF triple indices (spo, ops,
>> etc.) as in AllegroGraph in addition to primary keys in relational
>> databases.
>> 
>> TTuple -- cool. +1
>> 
>> "Enums" -- I agree that enums are necessary, but we need even more: tagged
>> unions <https://en.wikipedia.org/wiki/Tagged_union 
>> <https://en.wikipedia.org/wiki/Tagged_union>>. They are part of the
>> system of algebraic data types which I described on Friday. An enum is a
>> special case of a tagged union in which there is no value, just a type tag.
>> May I suggest something like TValue, which contains a value (possibly
>> trivial) together with a type tag. This enables ORs and pattern matching.
>> For example, suppose "created" edges are allowed to point to either
>> "Project" or "Document" vertices. The in-type of "created" is
>> union{project:Project, document:Document). Now the in value of a specific
>> edge can be TValue("project", [some project vertex]) or TValue("document",
>> [some document vertex]) and you have the freedom to switch on the type tag
>> if you want to, e.g. the next step in the traversal can give you the "name"
>> of the project or the "title" of the document as appropriate.
>> 
>> Multi-properties -- agreed; has() is good enough.
>> 
>> Meta-properties -- again, this is where I think we should have a
>> lower-level select() operation. Then has() builds on that operation.
>> Whereas select() matches on fields of a relation, has() matches on property
>> values and other higher-order things. If you want properties of properties,
>> don't use has(); use select()/from(). Most of the time, you will just want
>> to use has().
>> 
>> Agreed that every *entity* should have an id(), and also a label() (though
>> it should always be possible to infer label() from the context). I would
>> suggest TEntity (or TElement), which has id(), label(), and value(), where
>> value() provides the raw value (usually a TTuple) of the entity.
>> 
>> Josh
>> 
>> 
>> 
>> On Mon, Apr 29, 2019 at 10:35 AM Marko Rodriguez <okramma...@gmail.com 
>> <mailto:okramma...@gmail.com>>
>> wrote:
>> 
>>> Hello Josh,
>>> 
>>>> A has("age",29), for example, operates at a different level of
>>> abstraction than a
>>>> has("city","Santa Fe") if "city" is a column in an "addresses" table.
>>> 
>>> So hasXXX() operators work on TTuples. Thus:
>>> 
>>> g.V().hasLabel(‘person’).has(‘age’,29)
>>> g.V().hasLabel(‘address’).has(‘city’,’Santa Fe’)
>>> 
>>> ..both work as a person-vertex and an address-vertex are TTuples. If these
>>> were tables, then:
>>> 
>>> jdbc.db().values(‘people’).has(‘age’,29)
>>> jdbc.db().values(‘addresses’).has(‘city’,’Santa Fe’)
>>> 
>>> …also works as both people and addresses are TTables which extend
>>> TTuple<String,?>.
>>> 
>>> In summary, its its a TTuple, then hasXXX() is good go.
>>> 
>>> ////////// IGNORE UNTIL AFTER READING NEXT SECTION //////////
>>> *** SIDENOTE: A TTable (which is a TSequence) could have Symbol-based
>>> metadata. Thus TTable.value(#label) -> “people.” If so, then
>>> jdbc.db().hasLabel(“people”).has(“age”,29)
>>> 
>>>> At least, they
>>>> are different if the data model allows for multi-properties,
>>>> meta-properties, and hyper-edges. A property is something that can either
>>>> be there, attached to an element, or not be there. There may also be more
>>>> than one such property, and it may have other properties attached to it.
>>> A
>>>> column of a table, on the other hand, is always there (even if its value
>>> is
>>>> allowed to be null), always has a single value, and cannot have further
>>>> properties attached.
>>> 
>>> 1. Multi-properties.
>>> 
>>> Multi-properties works because if name references a TSequence, then its
>>> the sequence that you analyze with has(). This is another reason why
>>> TSequence is important. Its a reference to a “stream” so there isn’t
>>> another layer of tuple-nesting.
>>> 
>>> // assume v[1] has name={marko,mrodriguez,markor}
>>> g.V(1).value(‘name’) => TSequence<String>
>>> g.V(1).values(‘name’) => marko, mrodriguez, markor
>>> g.V(1).has(‘name’,’marko’) => v[1]
>>> 
>>> 2. Meta-properties
>>> 
>>> // assume v[1] has name=[value:marko,creator:josh,timestamp:12303] // i.e.
>>> a tuple value
>>> g.V(1).value(‘name’) => TTuple<?,String> // doh!
>>> g.V(1).value(‘name’).value(‘value’) => marko
>>> g.V(1).value(‘name’).value(‘creator’) => josh
>>> 
>>> So things get screwy. — however, it only gets screwy when you mix your
>>> “metadata” key/values with your “data” key/values. This is why I think
>>> TSymbols are important. Imagine the following meta-property tuple for v[1]:
>>> 
>>> [#value:marko,creator:josh,timestamp:12303]
>>> 
>>> If you do g.V(1).value(‘name’), we could look to the value indexed by the
>>> symbol #value, thus => “marko”.
>>> If you do g.V(1).values(‘name’), you would get back a TSequence with a
>>> single TTuple being the meta property.
>>> If you do g.V(1).values(‘name’).value(), we could get the value indexed by
>>> the symbol #value.
>>> If you do g.V(1).values(‘name’).value(‘creator’), it will return the
>>> primitive string “josh”.
>>> 
>>> I believe that the following symbols should be recommended for use across
>>> all data structures.
>>>        #id, #label, #key, #value
>>> …where id(), label(), key(), value() are tuple.get(Symbol). Other symbols
>>> for use with propertygraph/ include:
>>>        #outE, #inV, #inE, #outV, #bothE, #bothV
>>> 
>>>> In order to simplify user queries, you can let has() and values() do
>>> double
>>>> duty, but I still feel that there are lower-level operations at play, at
>>> a
>>>> logical level even if not at a bytecode level. However, expressing the a
>>>> traversal in terms of its lowest-level relational operations may also be
>>>> useful for query optimization.
>>> 
>>> One thing that I’m doing, that perhaps you haven’t caught onto yet, is
>>> that I’m not modeling everything in terms of “tables.” Each data structure
>>> is trying to stay as pure to its conceptual model as possible. Thus, there
>>> are no “joins” in property graphs as outE() references a TSequence<TEdge>,
>>> where TEdge is an interface that extends TTuple. You can just walk without
>>> doing any type of INNER JOIN. Now, if you model a property graph in a
>>> relational database, you will have to strategize the bytecode accordingly!
>>> Just a heads up in case you haven’t noticed that.
>>> 
>>> Thanks for your input,
>>> Marko.
>>> 
>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ 
>>> <http://rredux.com/>>
>>> 
>>> 
>>> 
>>>> 
>>>> Josh
>>>> 
>>>> 
>>>> 
>>>> On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <okramma...@gmail.com 
>>>> <mailto:okramma...@gmail.com>
>>> <mailto:okramma...@gmail.com <mailto:okramma...@gmail.com>>>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> *** This email is primarily for Josh (and Kuppitz). However, if others
>>> are
>>>>> interested… ***
>>>>> 
>>>>> So I did a lot of thinking this weekend about structure/ and this
>>> morning,
>>>>> I prototyped both graph/ and rdbms/.
>>>>> 
>>>>> This is the way I’m currently thinking of things:
>>>>> 
>>>>>       1. There are 4 base types in structure/.
>>>>>               - Primitive: string, long, float, int, … (will constrain
>>>>> these at some point).
>>>>>               - TTuple<K,V>: key/value map.
>>>>>               - TSequence<V>: an iterable of v objects.
>>>>>               - TSymbol: like Ruby, I think we need “enum-like” symbols
>>>>> (e.g., #id, #label).
>>>>> 
>>>>>       2. Every structure has a “root.”
>>>>>               - for graph its TGraph implements TSequence<TVertex>
>>>>>               - for rdbms its a TDatabase implements
>>>>> TTuple<String,TTable>
>>>>> 
>>>>>       3. Roots implement Structure and thus, are what is generated by
>>>>> StructureFactory.mint().
>>>>>               - defined using withStructure().
>>>>>               - For graph, its accessible via V().
>>>>>               - For rdbms, its accessible via db().
>>>>> 
>>>>>       4. There is a list of core instructions for dealing with these
>>>>> base objects.
>>>>>               - value(K key): gets the TTuple value for the provided
>>> key.
>>>>>               - values(K key): gets an iterator of the value for the
>>>>> provided key.
>>>>>               - entries(): gets an iterator of T2Tuple objects for the
>>>>> incoming TTuple.
>>>>>               - hasXXX(A,B): various has()-based filters for looking
>>>>> into a TTuple and a TSequence
>>>>>               - db()/V()/etc.: jump to the “root” of the
>>> withStructure()
>>>>> structure.
>>>>>               - drop()/add(): behave as one would expect and thus.
>>>>> 
>>>>> ————
>>>>> 
>>>>> For RDBMS, we have three interfaces in rdbms/.
>>>>> (machine/machine-core/structure/rdbms)
>>>>> 
>>>>>       1. TDatabase implements TTuple<String,TTable> // the root
>>>>> structure that indexes the tables.
>>>>>       2. TTable implements TSequence<TRow<?>> // a table is a sequence
>>>>> of rows
>>>>>       3. TRow<V> implements TTuple<String,V>> // a row has string
>>> column
>>>>> names
>>>>> 
>>>>> I then created a new project at machine/structure/jdbc). The classes in
>>>>> here implement the above rdbms/ interfaces/
>>>>> 
>>>>> Here is an RDBMS session:
>>>>> 
>>>>> final Machine machine = LocalMachine.open();
>>>>> final TraversalSource jdbc =
>>>>>       Gremlin.traversal(machine).
>>>>>                       withProcessor(PipesProcessor.class).
>>>>>                       withStructure(JDBCStructure.class,
>>>>> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
>>>>> 
>>>>> System.out.println(jdbc.db().toList());
>>>>> System.out.println(jdbc.db().entries().toList());
>>>>> System.out.println(jdbc.db().value("people").toList());
>>>>> System.out.println(jdbc.db().values("people").toList());
>>>>> System.out.println(jdbc.db().values("people").value("name").toList());
>>>>> System.out.println(jdbc.db().values("people").entries().toList());
>>>>> 
>>>>> This yields:
>>>>> 
>>>>> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
>>>>> [PEOPLE:<table#PEOPLE>]
>>>>> [<table#people>]
>>>>> [<row#PEOPLE:1>, <row#PEOPLE:2>]
>>>>> [marko, josh]
>>>>> [NAME:marko, AGE:29, NAME:josh, AGE:32]
>>>>> 
>>>>> The bytecode of the last query is:
>>>>> 
>>>>> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
>>>>> entries]
>>>>> 
>>>>> JDBCDatabase implements TDatabase, Structure.
>>>>>       *** JDBCDatabase is the root structure and is referenced by db()
>>>>> *** (CRUCIAL POINT)
>>>>> 
>>>>> Assume another table called ADDRESSES with two columns: name and city.
>>>>> 
>>>>> 
>>>>> 
>>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
>>>>> 
>>>>> The above is equivalent to:
>>>>> 
>>>>> SELECT city FROM people,addresses WHERE people.name=addresses.name
>>>>> 
>>>>> If you want to do an inner join (a product), you do this:
>>>>> 
>>>>> 
>>>>> 
>>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
>>>>> 
>>>>> The above is equivalent to:
>>>>> 
>>>>> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
>>>>> 
>>>>> NOTES:
>>>>>       1. Instead of select(), we simply jump to the root via db() (or
>>>>> V() for graph).
>>>>>       2. Instead of project(), we simply use value() or values().
>>>>>       3. Instead of select() being overloaded with by() join syntax, we
>>>>> use has() and path().
>>>>>               - like TP3 we will be smart about dropping path() data
>>>>> once its no longer referenced.
>>>>>       4. We can also do LEFT and RIGHT JOINs (haven’t thought through
>>>>> FULL OUTER JOIN yet).
>>>>>               - however, we don’t support ‘null' in TP so I don’t know
>>>>> if we want to support these null-producing joins. ?
>>>>> 
>>>>> LEFT JOIN:
>>>>>       * If an address doesn’t exist for the person, emit a
>>> “null”-filled
>>>>> path.
>>>>> 
>>>>> jdbc.db().values(“people”).as(“x”).
>>>>> db().values(“addresses”).as(“y”).
>>>>>   choose(has(“name”,eq(path(“x”).by(“name”))),
>>>>>     identity(),
>>>>>     path(“y”).by(null).as(“y”)).
>>>>> path(“x”,”y")
>>>>> 
>>>>> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
>>>>> 
>>>>> RIGHT JOIN:
>>>>> 
>>>>> jdbc.db().values(“people”).as(“x”).
>>>>> db().values(“addresses”).as(“y”).
>>>>>   choose(has(“name”,eq(path(“x”).by(“name”))),
>>>>>     identity(),
>>>>>     path(“x”).by(null).as(“x”)).
>>>>> path(“x”,”y")
>>>>> 
>>>>> 
>>>>> SUMMARY:
>>>>> 
>>>>> There are no “low level” instructions. Everything is based on the
>>> standard
>>>>> instructions that we know and love. Finally, if not apparent, the above
>>>>> bytecode chunks would ultimately get strategized into a single SQL query
>>>>> (breadth-first) instead of one-off queries (depth-first) to improve
>>>>> performance.
>>>>> 
>>>>> Neat?,
>>>>> Marko.
>>>>> 
>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ 
>>>>> <http://rredux.com/>> <http://rredux.com/ <http://rredux.com/> <
>>> http://rredux.com/ <http://rredux.com/>>>
>

Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Reply via email to