Hi,
One of the things I would like to do for TP4 is generalize the data structure
out of graph. What does that mean?
Originally, I was thinking that we would have different Traversals for property
graph, RDF, RDBMS, document, etc. and thus, different instructions:
Property graph: V(), out(), inE(), etc.
RDF: T(), subject(), predicate(), object(), out(), in(), etc.
RDBMS: R(), join(), etc.
Document: D(), list(), etc.
I started down this path and realized that we should categorize all our
non-primitive objects using the following *Like interfaces:
PairLike: first(), second()
ListLike: get(int), add(int)
MapLike: keys(), values(), get(object)
VertexLike extends MapLike: outEdges(), inEdges()
EdgeLike extends MapLike: inVertex(), outVertex()
…and then I started thinking, perhaps we just keep the same TP3 Gremlin-steps
that we are use to, but just add 3 more:
R(table) -> MapLike (“rows”)
D(collection) -> MapLike (“documents”)
join() -> MapLike
For property graphs, everything is as expected:
g.V() -> VertexLike
g.V().outE() -> EdgeLike
g.V().properties() -> PairLike
g.V().keys() -> ListLike
g.V().values() -> ListLike
For RDF, literals are vertices and thus:
g.V() -> VertexLike
g.V().out(‘foaf:name’).id() -> XSD object (i.e. primitive object)
g.V().out(‘foaf:knows’).id() -> URI object
g.V().outE() -> EdgeLike
g.V().outE().value(‘namedGraph’) -> URI (the only edge properties are the named
graph of the triple)
For RDBMS, rows are just maps (or, thought of another way, vertices without
edges):
g.R(table) -> MapLike
g.R(table).join(V(table)).by(key) -> MapLike
g.R(table).properties() -> PairLike
g.R(table).keys() -> ListLike
g.R(table).values() -> ListLike
g.R(table).outE() -> // throws Exception
For document databases, documents are just nested maps:
g.D(collection) -> MapLike
g.D(collection).value(key) -> MapLike | ListLike | Object
g.D(collection).value(listKey).item(int) -> MapLike | ListLike | Object
g.D(collection).properties() -> PairLike
g.D(collection).keys() -> ListLike
g.D(collection).values() -> ListLike
And thus, the only real novel addition would be a join()-step which would be
generally useful outside of RDBMS. However, within RDBMS, it will most likely
be strategized to an SQL JOIN query.
g.R(“people”).join(R(“addresses”)).by(“ssn”) -> MapLike
==strategizes to==>
SELECT * FROM people, addresses
WHERE people.ssn = addresses.ssn
g.R(“people”).join(R(“addresses”)).by(“ssn”).select(“name”,”city”) -> MapLike
==strategizes to==>
SELECT people.name, addresses.city
FROM people, addresses WHERE people.ssn = addresses.ssn
JDBC provider specific instruction that replaces g.R().join().etc.: [sql,SELECT
*...]
Note that most data processors support join:
Beam:
https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/extensions/joinlibrary/Join.html
<https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/extensions/joinlibrary/Join.html>
Spark:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html
<https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html>
…we would simply need to add in-memory join functionality to Pipes. That would
be fun to implement and could be very useful for graph users that might want to
join vertices on properties! (who knows! .. why not?!)
————
Finally, we can also add some other interesting *Like interfaces:
ReferenceLike: location() // not sure what this returns yet, but some universal
IP like address thing
The point here is that providers can ensure that the objects they create in
TinkerPop can refer back to an object in the provider’s data source. This will
be important with distributed processors like Akka, where you want to make sure
you always manipulate ReferenceLike objects on the same machine as the
ReferenceLike.location(). In other words, query routing.
————
In summary, the idea is that we map all other data structures to *Like instead
of adding a bunch of new steps and having to figure out how to “import”
Traversal languages. The simple use of *Like interfaces will enable us to
encapsulate more data structures and apply Gremlin beyond graph. Now you can
start to imagine ETL use cases. For instance, RDBMS -> Graph:
g.R(“people”).select(“name”,”age”).
addV().property(“name”,select(“name”)).
property(“age”,select(“age”))
...what about going Graph -> RDBMS?
g.V().hasLabel(“person”).select(“name”,”age”).
addR().property(“name”,select(“name”)).
property(“age”,select(“age”))
Thoughts?,
Marko.
http://markorodriguez.com