Re: N-Tuples, Pointers, Data Model Interfaces, and Bytecode Instructions

Joshua Shinavier Sun, 12 May 2019 10:53:44 -0700

Hi Marko,

Just a quick reply for now, but this sounds like a good game plan, and
there is certainly a ton of research on query optimization, adaptive query
processing, etc. that could help to guide the development, and more
opportunities for static analysis (i.e. a type system) will help.


Btw. my initial implementation of MatchStep had some stats-driven
optimization built in. It was a little "off the top of my head", and also
not OLAP-friendly, but it would be worthwhile to revisit this in a bigger
and more formal way in TP4.

Josh


On Sun, May 12, 2019 at 10:37 AM Marko Rodriguez <[email protected]>
wrote:

> Hi,
>
> Thank you for your reply Josh. I think we are ultimately converging on:
>
>         TinkerPop4 as an open source, programmable query planner.
>
> We are going in this direction because of our three requirements:
>         1. Easy for language designers to have their language work with
> various data processing/storage systems.
>         2. Easy for processing engines to execute language expressions
> over data.
>         3. Easy for data storage systems to expose their data for
> computation.
> All the “easy”-talk implies “hard”-work for us. :)
>
> Thus far, TinkerPop as a graph processing framework, query planning has
> been relatively straightforward as we have relied on
>         (1) users knowing the shape of their data and writing queries that
> are ‘smart’.
>         (2) compiler strategies rewriting common expressions into cheaper
> expressions.
>         (3) our match()-step for dynamically resorting patterns based on
> runtime cost analyses.
> However, I think moving forward (to capture more complex data scenarios)
> we will need data storage system providers to expose statistics about their
> data. What does this entail? I believe it entails allowing data storage
> systems to expose:
>         (1) data paths (i.e. supported references through the data)
>         (2) data statistics (i.e. the time and space costs associated with
> particular data paths)
>
> For SQL query planning, it will take a few years for our framework to
> become top-notch. However, for Gremlin/Cypher, RQL/SPARQL,
> MongoQuery/XPath, CQL, etc. I believe we can pull it off with the resources
> we have on our first release as these “NoSQL”-systems tend to have simple
> 'data paths’ with, arguably, graph and RDF being the most difficult to
> reason on.
>
> ———
>
> What does the TP4 VM need to know and how will the various system
> components (language, processor, structure) provide that information?
>
> I believe we have been talking about this the whole time except now I am
> introducing costs.
>         * What are the types of tuples and how do they relate?
>         * How much does it cost to move through these tuple-relations?
>
> pg.graph
>   [data access instructions]
>   V()
>   V(id)
>   V(key,value)
>   [data costs instructions]
>   cost(V())
>   cost(V(id))
> pg.graph.vertex
>   [data access instructions]
>   out()
>   out(string)
>   [data costs instructions]
>   cost(out())
>   cost(out(string))
>   …
>
> In other words, for every type of tuple, we need to know what instructions
> it supports and we we need to know the time/space costs of said
> instructions.
>
>         TP4 VM uses the “data cost”-instructions to construct the query
> plan.
>                 cost(out(‘knows’)) = { space:10345, time:O(1) } //
> in-memory graphdb
>                 cost(out(‘knows’)) = { space:10345, time:O(log(n)) } //
> RDBMS-encoded graphdb
>         TP4 processors use the “data access”-instructions to process the
> data.
>                 out(‘knows’) -> Iterator<Vertex>
>
> What I showed above was PropertyGraphs as of TP2. What about when labels
> and schemas are involved? This is where Josh’s concept of “typing” comes
> into play.
>
> pg.graph
> pg.graph.vertex.person
>   out(‘knows’)
>   out(‘created’)
>   cost(out(‘knows’))
>   cost(out(‘created’))
> pg.graph.vertex.project
> pg.graph.edge.knows
> pg.graph.edge.created
> ...
>
> With schema, we of course get more refined statistics….
>
> Thus, I think that we are ultimately trying to create a Multi-Model ADT
> that exposes data paths (the explicit structure in the data including
> auxiliary indices) and data costs (the time/space-statistics of such
> paths). TP4 VM uses that information to:
>
>         1. Take unoptimized bytecode from the language provider (easy for
> them, only operational semantic query planning required).
>         2. Convert that bytecode into an optimized bytecode for the data
> storage system (easy for them, they only need to say what instructions they
> support and costs).
>         3. Submit that bytecode to processor for execution (easy for them,
> they simply use their query planner to execute the data storage optimized
> data flow).
>
> Thoughts?,
> Marko.
>
> http://rredux.com <http://rredux.com/>
>
>
>
>
> > On May 11, 2019, at 9:08 AM, Joshua Shinavier <[email protected]> wrote:
> >
> > Oops, looked back at my email and noticed some nonsense. This:
> >
> > knows_weight_gt_85(v, e) = knows(v, e) ∧ weight(e, w) ∧ w>0.85
> >
> >
> > should be more like this:
> >
> > knows_weight_gt_85(_, v, e) = knows(e, v, _) ∧ weight(_, e, w) ∧ w>0.85
> >
> >
> > because we need the identity of edges (and by extension, properties). So
> we
> > treat edge relations like triples, where the first element of the triple
> is
> > the identity of the edge. The _ symbol represents "don't care" variables.
> >
> > Josh
> >
> >
> > On Sat, May 11, 2019 at 8:01 AM Joshua Shinavier <[email protected]>
> wrote:
> >
> >> Hi Marko,
> >>
> >> Responses inline.
> >>
> >>
> >> On Tue, May 7, 2019 at 6:26 AM Marko Rodriguez <[email protected]>
> >> wrote:
> >>
> >>> Whoa.
> >>>
> >>> Check out this trippy trick.
> >>>
> >>> First, here is how you define a pointer to a map-tuple.
> >>>
> >>>        *{k1?v1, k2?v2, …, kn?vn}
> >>>                * says “this is a pointer to a map" { }
> >>>                ? is some comparator like =, >, <, !=, contains(), etc.
> >>>
> >>
> >> OK.
> >>
> >>
> >>
> >>
> >>> Assume the vertex map tuple v[1]:
> >>>
> >>> {#id:1, #label:person, name:marko, age:29}
> >>>
> >>> Now, we can add the following fields:
> >>>
> >>> 1. #outE:*{#outV=*{#id=1}}  // references all tuples that have an outV
> >>> field that is a pointer to the the v[1] vertex tuple.
> >>>
> >>
> >> Yes, I agree. You won't be surprised to hear me say that this boils down
> >> to good old select() and project(), e.g.
> >>
> >> g.V(1).select("*", "out")
> >>
> >>
> >> Here, I am using "*" as shorthand for "any matching relation", i.e.
> >> anything with an "out" (your "outV") field.
> >>
> >>
> >>
> >>> 2. #outE.knows:*{#outV=*{#id=1},#label=knows} // references all
> outgoing
> >>> knows-edges.
> >>>
> >>
> >> g.V(1).select("knows", "out")
> >>
> >>
> >>
> >> 3. #outE.knows.weight_gt_85:*{#outV=*{#id=1},#label=knows,weight>0.85}
> //
> >>> references all strong outgoing knows-edges
> >>>
> >>
> >> g.V(1).select("knows", "out").as("e").select("weight",
> >> "out").project("in").is(P.gt(0.85)).back("e")
> >>
> >>
> >> I like how you are giving a name to a new relation built up of other
> >> relations. In relational calculus, this looks something like:
> >>
> >> knows_weight_gt_85(v, e) = knows(v, e) ∧ weight(e, w) ∧ w>0.85
> >>
> >>
> >> And I do think we should make defining new relations as straightforward
> as
> >> possible in TP4. Compositionality is life.
> >>
> >>
> >>
> >>> By using different types of pointers, a graph database provider can
> make
> >>> explicit their internal structure. Assume all three fields above are
> in the
> >>> v[1] vertex tuple. This means that:
> >>>
> >>>        1. all of v[1]’s outgoing edges are group together. <— linear
> scan
> >>>
> >>
> >> By convention, a relation could be indexed left to right, so
> knows_weight_gt_85(v,
> >> e) would express exactly that.
> >>
> >>
> >>
> >>>        2. all of v[1]’s outgoing knows-edges are group together. <—
> >>> indexed by label
> >>>
> >>
> >> Same.
> >>
> >>
> >>
> >>>        3. all of v[1]’s strong outgoing knows-edges are group together
> >>> <— indexed by label and weight
> >>>
> >>
> >> Yep.
> >>
> >>
> >>
> >>> Thus, a graph database provider can describe the way in which it
> >>> internally organizes adjacent edges — i.e. vertex-centric indices!
> >>
> >>
> >> This looks like convergence.
> >>
> >>
> >>
> >>> This means then that TP4 can do vertex-centric index optimizations
> >>> automatically for providers!
> >>>
> >>
> >> Ex-actly.
> >>
> >>
> >>
> >>>        1. values(“#outE”).hasLabel(‘knows’).has(‘weight’,gt(0.85)) //
> >>> grab all edges, then filter on label, then filter on weight.
> >>>        2. values(“#outE.knows”).has(‘weight’,gt(0.85)) // grab all
> >>> knows-edges, then filter on weight.
> >>>        3. values(“#outE.knows.weight_gt_85”) // grab all strong
> >>> knows-edges.
> >>>
> >>> *** Realize that Gremlin outE() will just compile to bytecode
> >>> values(“#outE”).
> >>>
> >>> Freakin’ crazy! … Josh was interested in using the n-tuple structure to
> >>> describe indices. I was against it. I believe I still am. However,
> this is
> >>> pretty neat. As Josh was saying though, without a rich enough n-tuple
> >>> description of the underlying database, there should be no reason for
> >>> providers to have to write custom strategies and instructions ?!?!?!?!?
> >>> crazy!?
> >>>
> >>
> >> I think we might not mean the same thing by "indices", so maybe we just
> >> don't get hung up on that term, but we are on the same page w.r.t. what
> you
> >> wrote in this email. What's more, these indices... ok, maybe we do need
> to
> >> call them indices... can be relations of more than two variables. See
> the
> >> geospatial index example from my previous email. A vertex-centric index
> is
> >> to an edge what a generic index is to a hyperedge.
> >>
> >>
> >> Josh
> >>
> >>
> >>
> >>>
> >>> Marko.
> >>>
> >>> http://rredux.com <http://rredux.com/>
> >>>
> >>>
> >>>
> >>>
> >>>> On May 7, 2019, at 4:44 AM, Marko Rodriguez <[email protected]>
> >>> wrote:
> >>>>
> >>>> Hey Josh,
> >>>>
> >>>>> I think of your Pointer<T> as a reference to an entity. It does not
> >>> contain
> >>>>> the entity it refers to, but it contains the primary key of that
> >>> entity.
> >>>>
> >>>> Exactly! I was just thinking that last night. Tuples don’t need a
> >>> separate ID system. No -- pointers reference the primary key of a
> tuple!
> >>> Better yet perhaps, they can reference one-to-many. For instance:
> >>>>
> >>>> { id:1, label:person, name:marko, age:29, outE:*(outV=id) }
> >>>>
> >>>> Thus, a pointer is defined by a pattern match. Haven’t thought through
> >>> the consequences, but … :)
> >>>>
> >>>>> Here, I have invented an Entity class to indicate that the pointer
> >>> resolves
> >>>>> to a vertex (an entity without a tuple, or rather with a 0-tuple --
> the
> >>>>> unit element).
> >>>>
> >>>> Ah — the 0-tuple. Neat thought.
> >>>>
> >>>> I look forward to your slides from the Knowledge Graph Conference. If
> I
> >>> wasn’t such a reclusive hermit, I would have loved to have joined you
> there.
> >>>>
> >>>> Take care,
> >>>> Marko.
> >>>>
> >>>> http://rredux.com <http://rredux.com/>
> >>>>
> >>>>
> >>>>> On Mon, May 6, 2019 at 9:38 PM Marko Rodriguez <[email protected]
> >>> <mailto:[email protected]>> wrote:
> >>>>>
> >>>>>> Hey Josh,
> >>>>>>
> >>>>>>> I am feeling the tuples... as long as they can be typed, e.g.
> >>>>>>>
> >>>>>>>   <V> myTuple.get(Integer) -- int-indexed tuples
> >>>>>>>   <V> myTuple.get(String) -- string-indexed tuples
> >>>>>>> In most programming languages, "tuples" are not lists, though they
> >>> are
> >>>>>> typed by a list of element types. E.g. in Haskell you might have a
> >>> tuple
> >>>>>> with the type
> >>>>>>>   (Double, Double, Bool)
> >>>>>>
> >>>>>>
> >>>>>> Yes, we have Pair<A,B>, Triple<A,B,C>, Quadruple<A,B,C,D>, etc.
> >>> However
> >>>>>> for base Tuple<A> of unknown length, the best I can do in Java is
> >>> <A>. :|
> >>>>>> You can see my stubs in the gist:
> >>>>>>
> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8
> >>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> <
> >>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 <
> >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>>
> (LINES
> >>>>>> #21-42)
> >>>>>>
> >>>>>>> If this is in line with your proposal, then we agree that tuples
> >>> should
> >>>>>> be the atomic unit of data in TP4.
> >>>>>>
> >>>>>> Yep. Vertices, Edges, Rows, Documents, etc. are all just tuples.
> >>> However,
> >>>>>> I suspect that we will disagree on some of my tweaks. Thus, I’d
> >>> really like
> >>>>>> to get your feedback on:
> >>>>>>
> >>>>>>       1. pointers (tuple entries referencing tuples).
> >>>>>>       2. sequences (multi-value tuple entries).
> >>>>>>       3. # hidden map keys :|
> >>>>>>               - sorta ghetto.
> >>>>>>
> >>>>>> Also, I’m still not happy with
> db().has().has().as(‘x’).db().where()…
> >>> its
> >>>>>> an intense syntax and its hard to strategize.
> >>>>>>
> >>>>>> I really want to nail down this “universal model” (tuple structure
> and
> >>>>>> tuple-oriented instructions) as then I can get back on the codebase
> >>> and
> >>>>>> start to flush this stuff out with confidence.
> >>>>>>
> >>>>>> See ya,
> >>>>>> Marko.
> >>>>>>
> >>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <
> >>> http://rredux.com/>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Josh
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, May 6, 2019 at 5:34 PM Marko Rodriguez <
> [email protected]
> >>> <mailto:[email protected]>
> >>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I spent this afternoon playing with n-tuples, pointers, data model
> >>>>>> interfaces, and bytecode instructions.
> >>>>>>>
> >>>>>>>
> >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 <
> >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> <
> >>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 <
> >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>> <
> >>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 <
> >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> <
> >>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 <
> >>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>>>
> >>>>>>>
> >>>>>>> *** Kuppitz: They are tuples :). A Map<K,V> extends
> Tuple<Pair<K,V>>.
> >>>>>> Tada!
> >>>>>>>
> >>>>>>> What I like about this is that it combines the best of both worlds
> >>>>>> (Josh+Marko).
> >>>>>>>       * just flat tuples of arbitrary length.
> >>>>>>>               * pattern matching for arbitrary joins. (k1=k2 AND
> >>> k3=k4
> >>>>>> …)
> >>>>>>>               * pointers chasing for direct links. (edges, foreign
> >>>>>> keys, document _id references, URI resolutions, …)
> >>>>>>>       * sequences are a special type of tuple used for multi-valued
> >>>>>> entries.
> >>>>>>>       * has()/values()/etc. work on all tuple types! (maps, lists,
> >>>>>> tuples, vertices, edges, rows, statements, documents, etc.)
> >>>>>>>
> >>>>>>> Thoughts?,
> >>>>>>> Marko.
> >>>>>>>
> >>>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <
> >>> http://rredux.com/>> <http://rredux.com/ <http://rredux.com/> <
> >>>>>> http://rredux.com/ <http://rredux.com/>>>
> >>>>
> >>>
> >>>
>
>

Re: N-Tuples, Pointers, Data Model Interfaces, and Bytecode Instructions

Reply via email to