Hello, > I think this does satisfy your requirements, though I don't think I > understand all aspects the approach, especially the need for > TinkerPop-specific types *for basic scalar values* like booleans, strings, > and numbers. Since we are committed to the native data types supported by > the JVM.
TinkerPop4 will have VM implementations on various language-platforms. For sure, Apache’s distribution will have a JVM and .NET implementation. The purpose of TinkerPop-specific types (and not JVM, Mono, Python, etc.) types is that we know its the same type across all VMs. > To my mind, your approach is headed in the direction of a > TinkerPop-specific notion of a *type*, in general, which captures the > structure and constraints of a logical data type > <https://www.slideshare.net/joshsh/a-graph-is-a-graph-is-a-graph-equivalence-transformation-and-composition-of-graph-data-models-129403012/42 > > <https://www.slideshare.net/joshsh/a-graph-is-a-graph-is-a-graph-equivalence-transformation-and-composition-of-graph-data-models-129403012/42>>, > and which can be used for query planning and optimization. These include > both scalar types as well as vertex, edge, and property types, as well as > more generic constructs such as optionals, lists, records. Yes — I’d like to be able to use some type of formal data type specification. You have those skills. I don’t. My rudimentary (non-categorical) representation is just “common useful data structures” — map, list, bool, string, etc. > Can a TList really only contain primitives? A list of vertices or edges > would definitely be unusual, and typical PG implementations may not choose > to support them, but language-agnostic VM possibly should. They would > nicely capture RDF lists, in which list nodes typically do not have any > properties (edges) other than rdf:first and rdf:rest. A TList only supports primitives. However, a TRDFList could be a complex type for dealing with RDF lists and would be contained with the TP4-VM. Adding complex types is okay — it doesn’t break anything. As a related concept — realize that TDocument has a TDocumentArray not a TList. This is because TDocuments can have “lists” that contain primitives, documents, and lists. > For hypergraphs, an inV and outV which may produce more than one vertex, is > one way to go, but a labeled hypergraph should really have other projections > <https://www.slideshare.net/joshsh/a-graph-is-a-graph-is-a-graph-equivalence-transformation-and-composition-of-graph-data-models-129403012/49 > > <https://www.slideshare.net/joshsh/a-graph-is-a-graph-is-a-graph-equivalence-transformation-and-composition-of-graph-data-models-129403012/49>> > in addition to inV, outV. That suggests a more generic step than inV or > outV, which takes as an argument the name of the projection as well as the > in/out element. E.g. project("in", v1), project("out", v1), > project("subject", v1). Hm. Yea, I’m not too strong with hypergraph thinking. g.V(1) // vertex g.V(1).outE(‘family’) // hyperedges g.V(1).outE(‘family’).inV(‘father’) // ? perhaps inV/outV/bothV can take a String… label? We should talk to the GRAKN.AI guys and see what they think. https://grakn.ai/ <https://grakn.ai/> https://dev.grakn.ai/docs/general/quickstart <https://dev.grakn.ai/docs/general/quickstart> > For undirected graphs, we might as well just allow both in() and out() > rather than throwing exceptions. You can think of an undirected edge as a > pair of directed edges. Okay. > Agreed that provider-specific structures (types) are OK, and should not be > discouraged. Not only do different providers have their own data models, > but specific applications have their own schemas. A structure like a > metaproperty may be allowed in certain contexts and not others, and the > same goes for instances of conventional structures like edges of a certain > label. Yes. I want to make sure we naturally/natively support property graphs, RDF graphs, hypergraphs, tables, documents, etc. Property graphs (as specified by Neo4j) are not “special” in TP4. Like Gremlin for languages, property graphs sit side-by-side w/ other data structures. If we do this right, we will be heros! > For multi-properties, there is a distinction to be made between multiple > properties with the same key and element, and single collection-valued > properties. This is something the PG Working Group has been grappling with. > I think both should be allowed. Agreed. This all gets back to a way to specify what the data structure is: JanusGraph: a single-labeled property graph with multi/meta-properties. Neo4j: a multi-labeled property graph with singleton properties (w/ list values supported). RDF: an unlabeled 1-property graph (named graph property?) with vertex-based literals. … ?. Like Graph.Features in TP3. > IMO it's OK if URIs, in an RDF context, become Strings in a TP context. You > can think of URI as a constraint on String, which should be enforced at the > appropriate time, but does not require a vendor-specific class. Can you > concatenate two URIs? Sure... just concatenate the Strings, but also be > aware that the result is not a URI. Cool. Thanks for reading and providing good ideas. Marko. http://rredux.com > On Mon, Apr 15, 2019 at 5:06 AM Marko Rodriguez <[email protected] > <mailto:[email protected]>> > wrote: > >> Hello, >> >> I have a consolidated approach to handling data structures in TP4. I would >> appreciate any feedback you many have. >> >> 1. Every object processed by TinkerPop has a TinkerPop-specific >> type. >> - TLong, TInteger, TString, TMap, TVertex, TEdge, TPath, >> TList, … >> - BENEFIT #1: A universal type system will protect us from >> language platform peculiarities (e.g. Python long vs Java long). >> - BENEFIT #2: The serialization format is constrained and >> consistent across all languages platforms. (no more coming across a >> MySpecialClass). >> 2. All primitive T-type data can be directly access via get(). >> - TBoolean.get() -> java.lang.Boolean | System.Boolean | >> ... >> - TLong.get() -> java.lang.Long | System.Int64 | ... >> - TString.get() -> java.lang.String | System.String | … >> - TList.get() -> java.lang.ArrayList | .. // can only >> contain primitives >> - TMap.get() -> java.lang.LinkedHashMap | .. // can only >> contain primitives >> - ... >> 3. All complex T-types have no methods! (except those afforded by >> Object) >> - TVertex: no accessible methods. >> - TEdge: no accessible methods. >> - TRow: no accessible methods. >> - TDocument: no accessible methods. >> - TDocumentArray: no accessible methods. // a document >> list field that can contain complex objects >> - ... >> >> REQUIREMENT #1: We need to be able to support multiple graphdbs in the >> same query. >> - e.g., read from JanusGraph and write to Neo4j. >> REQUIREMENT #2: We need to make sure complex objects can not be queried >> client-side for properties/edges/etc. data. >> - e.g., vertices are universally assumed to be “detached." >> REQUIREMENT #3: We no longer want to maintain a structure test suite. >> Operational semantics should be verified via Bytecode -> >> Processor/Structure. >> - i.e., the only way to read/write vertices is via >> Bytecode as complex T-types don’t have APIs. >> REQUIREMENT #4: We should support other database data structures besides >> graph. >> - e.g., reading from MySQL and writing to JanusGraph. >> >> ——— >> >> Assume the following TraversalSource: >> >> g.withStructure(JanusGraphStructure.class, config1). >> withStructure(Neo4jStructure.class, conflg2) >> >> Now, assume the following traversal fragment: >> >> outE(’knows’).has(’stars’,5).inV() >> >> This would initially be written to Bytecode as: >> >> [[outE,knows],[has,stars,5],[inV]] >> >> A decoration strategy realizes that there are two structures registered in >> the Bytecode source instructions and would rewrite the above as: >> >> [choose,[[type,TVertex]],[[outE,knows],[has,stars,5],[inV]]] >> >> A JanusGraph strategy would rewrite this as: >> >> >> [choose,[[type,TVertex]],[[outE,knows],[has,stars,5],[inV]],[[type,JanusVertex]],[[jg:vertexCentric,out,knows,stars,5]]] >> >> A Neo4j strategy would rewrite this as: >> >> >> [choose,[[type,TVertex]],[[outE,knows],[has,stars,5],[inV]],[[type,JanusVertex]],[[jg:vertexCentric,out,knows,stars,5]],[[type,Neo4jVertex]],[[neo:outE,knows],[neo:has,stars,5],[neo:inV]]] >> >> A finalization strategy would rewrite this as: >> >> >> [choose,[[type,JanusVertex]],[[jg:vertexCentric,out,knows,stars,5]],[[type,Neo4jVertex]],[[neo:outE,knows],[neo:has,stars,5],[neo:inV]]] >> >> Now, when a TVertex gets to this CFunction, it will check its type, if its >> a JanusVertex, it goes down the JanusGraph-specific instruction branch. If >> the type is Neo4jVertex, it goes down the Neo4j-specific instruction branch. >> >> REQUIREMENT #1 SOLVED >> >> The last instruction of the root bytecode can not return a complex object. >> If so, an exception is thrown. g.V() is illegal. g.V().id() is legal. >> Complex objects do not exist outside the TP4-VM. Only primitives can leave >> the VM-client barrier. If you want vertex property data (e.g.), you have to >> access it and return it within the traversal — e.g., g.V().valueMap(). >> BENEFIT #1: Language variant implementations are simple. Just >> primitives. >> BENEFIT #2: The serialization specification is simple. Just >> primitives. (also, note that Bytecode is just a TList of primitives! — >> though TBytecode will exist.) >> BENEFIT #3: The concept of a “DetachedVertex” is universally >> assumed. >> >> REQUIREMENT #2 SOLVED >> >> It is completely up to the structure provider to use structure-specific >> instructions for dealing with their particular TVertex. They will have to >> provide CFunction implementations for out, in, both, has, outE, inE, bothE, >> drop, property, value, id, label … (seems like a lot, but out/in/both could >> be one parameterized CFunction). >> BENEFIT #1: No more structure/ API and structure/ test suite. >> BENEFIT #2: The structure provider has full control of where the >> vertex data is stored (cached in memory or fetch from the db or a cut >> vertex or …). No assumptions are made by the TP4-VM. >> BENEFIT #3: The structure provider can safely assume their >> vertices will not be accessed outside the TP4-VM (outside the processor). >> >> REQUIREMENT #3 SOLVED >> >> We can support TRow for relational databases. A TRow’s data is accessible >> via the instructions has, hasKey, value, property, id, ... The location of >> the data in TRow is completely up to the structure provider and its >> strategy analysis (if only ’name’ is accessed, then SELECT ’name’ FROM...). >> We can easily support TDocument for document databases. A TDocument’s data >> is accessible via the instructions has, hasKey, value, property, id, … A >> value() could return yet another TDocument (or a TDocumentArray containing >> TDocuments). >> >> Supporting a new complex type is simply a function of asking: >> >> “Does the TP4 VM instruction set have the requisite >> instruction-types (semantically) to manipulate this structure?" >> >> We are no longer playing the language-specific object API game. We are >> playing the language-agnostic VM instruction game. The TP4-VM instruction >> set is the sole determiner of what complex objects can be processed. (i.e. >> what data structures can be processed without impedance mismatch). >> >> REQUIREMENT #4 SOLVED >> >> ——— >> >> The TP4-VM (and, in turn, Gremlin) can naturally support: >> >> 1. Property graphs: as currently supported in TP3. >> 2. RDF graphs: id() is a URI | Literal. g.V(1).value(‘foaf:name’) >> returns multi/meta-properties *or* g.V(1).out(‘foaf:name’) returns vertices >> whose id()s are xsd:string literals. >> 3. Hypergraphs: inV() can return more than one vertex. >> 4. Undirected graphs: in() and out() throw exceptions. Only both() >> works. >> 5. Meta-properties: value(‘name’) can return a TVertexProperty (a >> special complex object that is structure provider specific — and that is >> okay!). >> 6. Multi-properties: value(‘name’) can return a TPropertyArray of >> TVertexProperty objects. >> >> This means that the same instruction can behave differently for different >> structures. This is okay as there can be property graph, RDF, hypergraph, >> etc. test suites. >> >> Since complex objects don’t leave the TP4-VM barrier, providers can create >> any complex objects they want — they just have to have corresponding >> strategies to create provider-unique bytecode instructions (and thus, >> CFunctions) for those complex objects. >> >> Finally. there are a few of problems to work out: >> - There is no way to yield a “v[1]” or “e[3][v[1]-knows->v[2]]” >> representation. Is that bad? Perhaps not. >> - What is the nature of a TPath? Its complex, but we want to >> return it. >> - g.V().id() on an RDF graph can return a URI. Is a URI “simple”? >> No, the set of simple types should never grow!…. thus, URI => String. Is >> that wack? >> - Do we add g.R() and g.D() to Gremlin to type-support TRow and >> TDocument objects. g.V() would be weird :( … Hmmmm? >> - However, there are only so many data structures……. or >> are there? TMatrix, TXML, …. whoa. >> >> Thanks for reading, >> Marko. >> >> http://rredux.com <http://rredux.com/> <http://rredux.com/ >> <http://rredux.com/>>
