Hi,

One of the things I would like to do for TP4 is generalize the data structure 
out of graph. What does that mean?

Originally, I was thinking that we would have different Traversals for property 
graph, RDF, RDBMS, document, etc. and thus, different instructions:

        Property graph: V(), out(), inE(), etc.
        RDF: T(), subject(), predicate(), object(), out(), in(), etc.
        RDBMS: R(), join(), etc.
        Document: D(), list(), etc.

I started down this path and realized that we should categorize all our 
non-primitive objects using the following *Like interfaces:

PairLike: first(), second()
ListLike: get(int), add(int)
MapLike: keys(), values(), get(object)
VertexLike extends MapLike: outEdges(), inEdges()
EdgeLike extends MapLike: inVertex(), outVertex()

…and then I started thinking, perhaps we just keep the same TP3 Gremlin-steps 
that we are use to, but just add 3 more:

R(table)      -> MapLike (“rows”)
D(collection) -> MapLike (“documents”)
join()        -> MapLike

For property graphs, everything is as expected:

g.V() -> VertexLike
g.V().outE() -> EdgeLike
g.V().properties() -> PairLike  
g.V().keys() -> ListLike
g.V().values() -> ListLike

For RDF, literals are vertices and thus:
        
g.V() -> VertexLike
g.V().out(‘foaf:name’).id() -> XSD object (i.e. primitive object)
g.V().out(‘foaf:knows’).id() -> URI object
g.V().outE() -> EdgeLike
g.V().outE().value(‘namedGraph’) -> URI (the only edge properties are the named 
graph of the triple)
        
For RDBMS, rows are just maps (or, thought of another way, vertices without 
edges):

g.R(table) -> MapLike
g.R(table).join(V(table)).by(key) -> MapLike
g.R(table).properties() -> PairLike     
g.R(table).keys() -> ListLike
g.R(table).values() -> ListLike
g.R(table).outE() -> // throws Exception

For document databases, documents are just nested maps:

g.D(collection) -> MapLike
g.D(collection).value(key) -> MapLike | ListLike | Object
g.D(collection).value(listKey).item(int) -> MapLike | ListLike | Object
g.D(collection).properties() -> PairLike
g.D(collection).keys() -> ListLike
g.D(collection).values() -> ListLike

And thus, the only real novel addition would be a join()-step which would be 
generally useful outside of RDBMS. However, within RDBMS, it will most likely 
be strategized to an SQL JOIN query.

g.R(“people”).join(R(“addresses”)).by(“ssn”) -> MapLike
        ==strategizes to==>
SELECT * FROM people, addresses 
  WHERE people.ssn = addresses.ssn

g.R(“people”).join(R(“addresses”)).by(“ssn”).select(“name”,”city”) -> MapLike
        ==strategizes to==>
SELECT people.name, addresses.city 
  FROM people, addresses WHERE people.ssn = addresses.ssn

JDBC provider specific instruction that replaces g.R().join().etc.: [sql,SELECT 
*...]

Note that most data processors support join:

        Beam: 
https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/extensions/joinlibrary/Join.html
 
<https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/extensions/joinlibrary/Join.html>
        Spark: 
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html 
<https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html>

…we would simply need to add in-memory join functionality to Pipes. That would 
be fun to implement and could be very useful for graph users that might want to 
join vertices on properties! (who knows! .. why not?!)

————

Finally, we can also add some other interesting *Like interfaces:

ReferenceLike: location() // not sure what this returns yet, but some universal 
IP like address thing

The point here is that providers can ensure that the objects they create in 
TinkerPop can refer back to an object in the provider’s data source. This will 
be important with distributed processors like Akka, where you want to make sure 
you always manipulate ReferenceLike objects on the same machine as the 
ReferenceLike.location(). In other words, query routing.

————

In summary, the idea is that we map all other data structures to *Like instead 
of adding a bunch of new steps and having to figure out how to “import” 
Traversal languages. The simple use of *Like interfaces will enable us to 
encapsulate more data structures and apply Gremlin beyond graph. Now you can 
start to imagine ETL use cases. For instance, RDBMS -> Graph:

g.R(“people”).select(“name”,”age”).
  addV().property(“name”,select(“name”)).
         property(“age”,select(“age”))

...what about going Graph -> RDBMS?

g.V().hasLabel(“person”).select(“name”,”age”).
  addR().property(“name”,select(“name”)).
         property(“age”,select(“age”))

Thoughts?,
Marko.

http://markorodriguez.com



Reply via email to