Hi Marko, Just a quick reply for now, but this sounds like a good game plan, and there is certainly a ton of research on query optimization, adaptive query processing, etc. that could help to guide the development, and more opportunities for static analysis (i.e. a type system) will help.
Btw. my initial implementation of MatchStep had some stats-driven optimization built in. It was a little "off the top of my head", and also not OLAP-friendly, but it would be worthwhile to revisit this in a bigger and more formal way in TP4. Josh On Sun, May 12, 2019 at 10:37 AM Marko Rodriguez <[email protected]> wrote: > Hi, > > Thank you for your reply Josh. I think we are ultimately converging on: > > TinkerPop4 as an open source, programmable query planner. > > We are going in this direction because of our three requirements: > 1. Easy for language designers to have their language work with > various data processing/storage systems. > 2. Easy for processing engines to execute language expressions > over data. > 3. Easy for data storage systems to expose their data for > computation. > All the “easy”-talk implies “hard”-work for us. :) > > Thus far, TinkerPop as a graph processing framework, query planning has > been relatively straightforward as we have relied on > (1) users knowing the shape of their data and writing queries that > are ‘smart’. > (2) compiler strategies rewriting common expressions into cheaper > expressions. > (3) our match()-step for dynamically resorting patterns based on > runtime cost analyses. > However, I think moving forward (to capture more complex data scenarios) > we will need data storage system providers to expose statistics about their > data. What does this entail? I believe it entails allowing data storage > systems to expose: > (1) data paths (i.e. supported references through the data) > (2) data statistics (i.e. the time and space costs associated with > particular data paths) > > For SQL query planning, it will take a few years for our framework to > become top-notch. However, for Gremlin/Cypher, RQL/SPARQL, > MongoQuery/XPath, CQL, etc. I believe we can pull it off with the resources > we have on our first release as these “NoSQL”-systems tend to have simple > 'data paths’ with, arguably, graph and RDF being the most difficult to > reason on. > > ——— > > What does the TP4 VM need to know and how will the various system > components (language, processor, structure) provide that information? > > I believe we have been talking about this the whole time except now I am > introducing costs. > * What are the types of tuples and how do they relate? > * How much does it cost to move through these tuple-relations? > > pg.graph > [data access instructions] > V() > V(id) > V(key,value) > [data costs instructions] > cost(V()) > cost(V(id)) > pg.graph.vertex > [data access instructions] > out() > out(string) > [data costs instructions] > cost(out()) > cost(out(string)) > … > > In other words, for every type of tuple, we need to know what instructions > it supports and we we need to know the time/space costs of said > instructions. > > TP4 VM uses the “data cost”-instructions to construct the query > plan. > cost(out(‘knows’)) = { space:10345, time:O(1) } // > in-memory graphdb > cost(out(‘knows’)) = { space:10345, time:O(log(n)) } // > RDBMS-encoded graphdb > TP4 processors use the “data access”-instructions to process the > data. > out(‘knows’) -> Iterator<Vertex> > > What I showed above was PropertyGraphs as of TP2. What about when labels > and schemas are involved? This is where Josh’s concept of “typing” comes > into play. > > pg.graph > pg.graph.vertex.person > out(‘knows’) > out(‘created’) > cost(out(‘knows’)) > cost(out(‘created’)) > pg.graph.vertex.project > pg.graph.edge.knows > pg.graph.edge.created > ... > > With schema, we of course get more refined statistics…. > > Thus, I think that we are ultimately trying to create a Multi-Model ADT > that exposes data paths (the explicit structure in the data including > auxiliary indices) and data costs (the time/space-statistics of such > paths). TP4 VM uses that information to: > > 1. Take unoptimized bytecode from the language provider (easy for > them, only operational semantic query planning required). > 2. Convert that bytecode into an optimized bytecode for the data > storage system (easy for them, they only need to say what instructions they > support and costs). > 3. Submit that bytecode to processor for execution (easy for them, > they simply use their query planner to execute the data storage optimized > data flow). > > Thoughts?, > Marko. > > http://rredux.com <http://rredux.com/> > > > > > > On May 11, 2019, at 9:08 AM, Joshua Shinavier <[email protected]> wrote: > > > > Oops, looked back at my email and noticed some nonsense. This: > > > > knows_weight_gt_85(v, e) = knows(v, e) ∧ weight(e, w) ∧ w>0.85 > > > > > > should be more like this: > > > > knows_weight_gt_85(_, v, e) = knows(e, v, _) ∧ weight(_, e, w) ∧ w>0.85 > > > > > > because we need the identity of edges (and by extension, properties). So > we > > treat edge relations like triples, where the first element of the triple > is > > the identity of the edge. The _ symbol represents "don't care" variables. > > > > Josh > > > > > > On Sat, May 11, 2019 at 8:01 AM Joshua Shinavier <[email protected]> > wrote: > > > >> Hi Marko, > >> > >> Responses inline. > >> > >> > >> On Tue, May 7, 2019 at 6:26 AM Marko Rodriguez <[email protected]> > >> wrote: > >> > >>> Whoa. > >>> > >>> Check out this trippy trick. > >>> > >>> First, here is how you define a pointer to a map-tuple. > >>> > >>> *{k1?v1, k2?v2, …, kn?vn} > >>> * says “this is a pointer to a map" { } > >>> ? is some comparator like =, >, <, !=, contains(), etc. > >>> > >> > >> OK. > >> > >> > >> > >> > >>> Assume the vertex map tuple v[1]: > >>> > >>> {#id:1, #label:person, name:marko, age:29} > >>> > >>> Now, we can add the following fields: > >>> > >>> 1. #outE:*{#outV=*{#id=1}} // references all tuples that have an outV > >>> field that is a pointer to the the v[1] vertex tuple. > >>> > >> > >> Yes, I agree. You won't be surprised to hear me say that this boils down > >> to good old select() and project(), e.g. > >> > >> g.V(1).select("*", "out") > >> > >> > >> Here, I am using "*" as shorthand for "any matching relation", i.e. > >> anything with an "out" (your "outV") field. > >> > >> > >> > >>> 2. #outE.knows:*{#outV=*{#id=1},#label=knows} // references all > outgoing > >>> knows-edges. > >>> > >> > >> g.V(1).select("knows", "out") > >> > >> > >> > >> 3. #outE.knows.weight_gt_85:*{#outV=*{#id=1},#label=knows,weight>0.85} > // > >>> references all strong outgoing knows-edges > >>> > >> > >> g.V(1).select("knows", "out").as("e").select("weight", > >> "out").project("in").is(P.gt(0.85)).back("e") > >> > >> > >> I like how you are giving a name to a new relation built up of other > >> relations. In relational calculus, this looks something like: > >> > >> knows_weight_gt_85(v, e) = knows(v, e) ∧ weight(e, w) ∧ w>0.85 > >> > >> > >> And I do think we should make defining new relations as straightforward > as > >> possible in TP4. Compositionality is life. > >> > >> > >> > >>> By using different types of pointers, a graph database provider can > make > >>> explicit their internal structure. Assume all three fields above are > in the > >>> v[1] vertex tuple. This means that: > >>> > >>> 1. all of v[1]’s outgoing edges are group together. <— linear > scan > >>> > >> > >> By convention, a relation could be indexed left to right, so > knows_weight_gt_85(v, > >> e) would express exactly that. > >> > >> > >> > >>> 2. all of v[1]’s outgoing knows-edges are group together. <— > >>> indexed by label > >>> > >> > >> Same. > >> > >> > >> > >>> 3. all of v[1]’s strong outgoing knows-edges are group together > >>> <— indexed by label and weight > >>> > >> > >> Yep. > >> > >> > >> > >>> Thus, a graph database provider can describe the way in which it > >>> internally organizes adjacent edges — i.e. vertex-centric indices! > >> > >> > >> This looks like convergence. > >> > >> > >> > >>> This means then that TP4 can do vertex-centric index optimizations > >>> automatically for providers! > >>> > >> > >> Ex-actly. > >> > >> > >> > >>> 1. values(“#outE”).hasLabel(‘knows’).has(‘weight’,gt(0.85)) // > >>> grab all edges, then filter on label, then filter on weight. > >>> 2. values(“#outE.knows”).has(‘weight’,gt(0.85)) // grab all > >>> knows-edges, then filter on weight. > >>> 3. values(“#outE.knows.weight_gt_85”) // grab all strong > >>> knows-edges. > >>> > >>> *** Realize that Gremlin outE() will just compile to bytecode > >>> values(“#outE”). > >>> > >>> Freakin’ crazy! … Josh was interested in using the n-tuple structure to > >>> describe indices. I was against it. I believe I still am. However, > this is > >>> pretty neat. As Josh was saying though, without a rich enough n-tuple > >>> description of the underlying database, there should be no reason for > >>> providers to have to write custom strategies and instructions ?!?!?!?!? > >>> crazy!? > >>> > >> > >> I think we might not mean the same thing by "indices", so maybe we just > >> don't get hung up on that term, but we are on the same page w.r.t. what > you > >> wrote in this email. What's more, these indices... ok, maybe we do need > to > >> call them indices... can be relations of more than two variables. See > the > >> geospatial index example from my previous email. A vertex-centric index > is > >> to an edge what a generic index is to a hyperedge. > >> > >> > >> Josh > >> > >> > >> > >>> > >>> Marko. > >>> > >>> http://rredux.com <http://rredux.com/> > >>> > >>> > >>> > >>> > >>>> On May 7, 2019, at 4:44 AM, Marko Rodriguez <[email protected]> > >>> wrote: > >>>> > >>>> Hey Josh, > >>>> > >>>>> I think of your Pointer<T> as a reference to an entity. It does not > >>> contain > >>>>> the entity it refers to, but it contains the primary key of that > >>> entity. > >>>> > >>>> Exactly! I was just thinking that last night. Tuples don’t need a > >>> separate ID system. No -- pointers reference the primary key of a > tuple! > >>> Better yet perhaps, they can reference one-to-many. For instance: > >>>> > >>>> { id:1, label:person, name:marko, age:29, outE:*(outV=id) } > >>>> > >>>> Thus, a pointer is defined by a pattern match. Haven’t thought through > >>> the consequences, but … :) > >>>> > >>>>> Here, I have invented an Entity class to indicate that the pointer > >>> resolves > >>>>> to a vertex (an entity without a tuple, or rather with a 0-tuple -- > the > >>>>> unit element). > >>>> > >>>> Ah — the 0-tuple. Neat thought. > >>>> > >>>> I look forward to your slides from the Knowledge Graph Conference. If > I > >>> wasn’t such a reclusive hermit, I would have loved to have joined you > there. > >>>> > >>>> Take care, > >>>> Marko. > >>>> > >>>> http://rredux.com <http://rredux.com/> > >>>> > >>>> > >>>>> On Mon, May 6, 2019 at 9:38 PM Marko Rodriguez <[email protected] > >>> <mailto:[email protected]>> wrote: > >>>>> > >>>>>> Hey Josh, > >>>>>> > >>>>>>> I am feeling the tuples... as long as they can be typed, e.g. > >>>>>>> > >>>>>>> <V> myTuple.get(Integer) -- int-indexed tuples > >>>>>>> <V> myTuple.get(String) -- string-indexed tuples > >>>>>>> In most programming languages, "tuples" are not lists, though they > >>> are > >>>>>> typed by a list of element types. E.g. in Haskell you might have a > >>> tuple > >>>>>> with the type > >>>>>>> (Double, Double, Bool) > >>>>>> > >>>>>> > >>>>>> Yes, we have Pair<A,B>, Triple<A,B,C>, Quadruple<A,B,C,D>, etc. > >>> However > >>>>>> for base Tuple<A> of unknown length, the best I can do in Java is > >>> <A>. :| > >>>>>> You can see my stubs in the gist: > >>>>>> > https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 > >>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> < > >>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 < > >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>> > (LINES > >>>>>> #21-42) > >>>>>> > >>>>>>> If this is in line with your proposal, then we agree that tuples > >>> should > >>>>>> be the atomic unit of data in TP4. > >>>>>> > >>>>>> Yep. Vertices, Edges, Rows, Documents, etc. are all just tuples. > >>> However, > >>>>>> I suspect that we will disagree on some of my tweaks. Thus, I’d > >>> really like > >>>>>> to get your feedback on: > >>>>>> > >>>>>> 1. pointers (tuple entries referencing tuples). > >>>>>> 2. sequences (multi-value tuple entries). > >>>>>> 3. # hidden map keys :| > >>>>>> - sorta ghetto. > >>>>>> > >>>>>> Also, I’m still not happy with > db().has().has().as(‘x’).db().where()… > >>> its > >>>>>> an intense syntax and its hard to strategize. > >>>>>> > >>>>>> I really want to nail down this “universal model” (tuple structure > and > >>>>>> tuple-oriented instructions) as then I can get back on the codebase > >>> and > >>>>>> start to flush this stuff out with confidence. > >>>>>> > >>>>>> See ya, > >>>>>> Marko. > >>>>>> > >>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ < > >>> http://rredux.com/>> > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> Josh > >>>>>>> > >>>>>>> > >>>>>>> On Mon, May 6, 2019 at 5:34 PM Marko Rodriguez < > [email protected] > >>> <mailto:[email protected]> > >>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> I spent this afternoon playing with n-tuples, pointers, data model > >>>>>> interfaces, and bytecode instructions. > >>>>>>> > >>>>>>> > >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 < > >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> < > >>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 < > >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>> < > >>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 < > >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> < > >>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 < > >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>>> > >>>>>>> > >>>>>>> *** Kuppitz: They are tuples :). A Map<K,V> extends > Tuple<Pair<K,V>>. > >>>>>> Tada! > >>>>>>> > >>>>>>> What I like about this is that it combines the best of both worlds > >>>>>> (Josh+Marko). > >>>>>>> * just flat tuples of arbitrary length. > >>>>>>> * pattern matching for arbitrary joins. (k1=k2 AND > >>> k3=k4 > >>>>>> …) > >>>>>>> * pointers chasing for direct links. (edges, foreign > >>>>>> keys, document _id references, URI resolutions, …) > >>>>>>> * sequences are a special type of tuple used for multi-valued > >>>>>> entries. > >>>>>>> * has()/values()/etc. work on all tuple types! (maps, lists, > >>>>>> tuples, vertices, edges, rows, statements, documents, etc.) > >>>>>>> > >>>>>>> Thoughts?, > >>>>>>> Marko. > >>>>>>> > >>>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ < > >>> http://rredux.com/>> <http://rredux.com/ <http://rredux.com/> < > >>>>>> http://rredux.com/ <http://rredux.com/>>> > >>>> > >>> > >>> > >
