As the community murmurs here and there from time to time about the use of
reference elements rather than detached elements, I thought I'd try to
document in one post how we arrived at where we are and perhaps outline the
challenges and issues that in my mind must be addressed to pivot from where
we are now (in the event we wanted to do so). Without addressing all of
these issues holistically, it's really impossible to consider a change.
As a quick summary, there are various situations where TinkerPop requires
that a graph element (Vertex, Edge, VertexProperty, Property) be "detached"
from its owning host. More simply, the element is cloned without reference
to the graph that owned it and it becomes anonymous. We detach for IO, OLAP
and other internal functions, but where it typically affects users and is
center to community discussion is with serialization of results from remote
Gremlin traversals. While I will focus this post on the latter, discussions
around this topic likely requires knowledge of the former as all these
things have a way of tying together. As an additional contextual note, I
will mostly write in terms of "Gremlin Server" in referring to "remote
traversal execution" though I think most readers will realize that we
really refer to any remote execution of Gremlin, to include even those
graph systems that don't specifically utilize Gremlin Server, like say
CosmosDB.
As a history lesson, Gremlin Server arrived with TinkerPop 3.x and was
built as an improved version of Rexster from TinkerPop 2.x. Gremlin
Server's purpose was to host a ScriptEngine which would accept Gremlin
strings and return results in a serialized form (GraphSON or Gryo). It is
here where the notion of detachment largely became a central concept in
TinkerPop. If Gremlin returned a graph element, we would detach it from the
graph on the server and give it to the serializer to return to the client.
On the client side, we would deserialize that element back to a detached
element. Why detached for the client? Because, on the client side there is
no graph to attach it to - it's owner is on the server! It must just exist
anonymously as a sort of data object that the user can access independently
from the graph.
When we first implemented detachment it seemed logical that the graph
element should be completely intact. Therefore if the element was an Edge,
it should have its id, label and all of its properties. This approach also
made it helpful in re-using certain code for IO operations which obviously
needed properties present if it was going to serialize an entire Graph
instance. Detachment also proved useful in OLAP for storing traverser
state. Of course, in that situation keeping traverser state of a graph
element filled with properties was massively expensive. As a result,
"reference" detachment was created where we could maintain just enough
information about the element to uniquely identify it and attach it back to
its host. Therefore, with reference detachment we only needed the id and
label of the element (perhaps even only the id for identification and
attachment, but I think we included label as a convenience...don't quite
remember).
Fast-forwarding several years (i.e. long enough to start getting feedback
on how TinkerPop 3.x was doing in the production world), we realized two
things that started to shape our thinking on how detachment should work:
1. Detached (i.e. with properties) was really really bad for fat vertices -
those with a lot of properties.
2. We didn't want users writing Gremlin as Groovy strings anymore and
instead wanted them to be able to write Gremlin in their native programming
language.
Regarding the first item of fat vertices, we found that this form of
detachment was leading to the writing of "bad" Gremlin. Users were writing
the SQL equivalent of SELECT * FROM table and were paying for it in
serialization costs. This issues was greatly compounded by multi-properties
where it's possible to imagine the possibility of millions of properties on
a single vertex.
Regarding the second item of writing Gremlin in languages other than Java,
we are talking about the birthing of the idea of Gremlin Language Variants.
The first of which was Python. The idea behind GLVs was not to re-write the
Gremlin Virtual Machine in every language (though that would be really cool
for so many reasons - perhaps that could still be a goal somehow someday)
but to provide a lightweight library that did some basics:
1. provide a method to spawn traversals and thus to write Gremlin
2. generate Gremlin bytecode from those traversals and submit them to
Gremlin Server
3. process results so as to allow users to work with the them in their
native data structures
To stay "lightweight", maintain this narrow focus, and to begin to address
some of the issues we saw with fat vertices we opted to take a page from
OLAP and use references (i.e. no properties for graph elements) for
bytecode based requests. This approach also enabled us to move quickly
because it meant less code to write and maintain (keeping in mind that we
have never really had tons of contributors to GLVs). With less code, we
were able to quickly turn around other GLVs in Javascript and .NET.
The side-effect of course was that users were forced into writing better
Gremlin. g.V().valueMap() became g.V().valueMap('name','age'). There were
more questions on stackoverflow and gremlin-users about project() and other
data shaping steps. Users seemed to become more knowledgeable on how to get
their data into the shape they needed it on the server using Gremlin,
rather than pulling back everything to the client and post-processing
results there. While using reference detachment might provide some initial
confusion, it seemed to get users to the production code they would
ultimately write more quickly.
Over time (years) we've become more comfortable with references and have
gone so far as to install ReferenceElementStrategy in the default packaging
of Gremlin Server which essentially makes script based requests and
bytecode based requests both detach to reference for consistency sake. For
3.5.0 I've had it in my mind that we would consider installing that
strategy by default even for embedded graphs so that they become wholly
consistent with remote graphs (not yet discussed...just a thought. no need
to dig into it on this thread. i'm only trying to demonstrate the
pervasiveness of this idea and how far it has come over time).
So with all that history in mind and with us now up to date with the
current way of things, here's some nice things about reference detachment:
1. Fat vertices are a non-issue.
2. GLVs stay lightweight and given that we don't have heavy contributor
support there it makes sense to keep it that way
3. Users are forced to write "good" Gremlin by (a) specifying the
properties they wish to return and (b) shaping their data on the server.
I think some downsides to reference detachment are:
1. If you are building some kind of visualization tool, analytics system or
OGM then graph elements are your primary currency and not having properties
strongly attached to their elements makes things harder depending on what
you're doing. Of course, your system is exposed to fat vertex problems so
the can has been kicked down the proverbial road.
2. Users get hit with some initial confusion as to why their graph element
doesn't have properties (this ends up being a common question to new users)
3. There is something nice about working with a Vertex or an Edge object in
your native language that gets lost in transformation to Map/List or other
representations. I'm describing more of a feeling here but there is also
something practical - you lose the notion of what the element was!
valueMap() and elementMap(), project(), etc. don't automatically yield you
any information as to whether that data came from a Vertex, Edge or
whatever. In some cases, that might be useful - like item 1 for example.
Anyone thinking about a discussion around using detached elements instead
of references should consider all of the above information, but should also
get familiar with the code itself as any real design talk is going to
impact everything from gremlin-core all the way up through to the GLVs. I
would also offer that folks should think about related issues so that if we
were to make some change we don't code ourselves into a corner with those
things - here are a few examples:
1. We have a definite problem with subgraph() step for GLVs as we have
nothing to deserialize a Graph instance into - there is no TinkerGraph in
Python
2. tree() is a bit of a second class citizen and i often wonder if
reference elements make tree() even less useful
3. What is the nature of a detached Path object and does it have the same
problem as the results of tree()
4. Is the notion of attachment a bit of an illusion? We can only attach in
JVM specific ways related to the Graph API and embedded operations which
leaves GLVs (and even Java remoting) at a disadvantage.
5. GraphBinary has left a placeholder for properties in elements but is
natively built to not serialize properties at all. This approach of course
creates an asymmetry with other serialization formats like GraphSON and
Gryo as it pertains to IO.
Hopefully this thread has demonstrated that this issue is bigger than what
it appears. We can't just add some new configuration to Gremlin Server to
easily make this all work. It's simply a more nuanced problem than that.
I'd ask that we restrict this thread to comments that expand points about
history of this issue, other advantages/disadvantages or other areas of the
code base that should be considered around this topic. Please take future
design discussion to a new thread.
Thanks,
Stephen