Re: Semantic overlap between Match and Where

Marko Rodriguez Wed, 24 Jun 2015 16:58:26 -0700

Hi,

So is g.V. With g.V.has('age',lt(60)) you get some filtering before going on to 
x.out(follows).


?,
Marko.

http://markorodriguez.com

On Jun 24, 2015, at 5:52 PM, Matthias Broecheler <[email protected]> wrote:

> I think you misunderstood my example. Take another look -
> as('x').has('age',lt(60))
> has the same start label as the match label. The name constraint applies to
> "y". This means, only the age constraint would be pulled out. And that
> would be bad because finding all vertices that have an age less than 60 can
> take forever.
> 
> On Wed, Jun 24, 2015 at 4:49 PM Marko Rodriguez <[email protected]>
> wrote:
> 
>> Hello,
>> 
>> Again, it only pulls it out if the match()-step if the has()-part start
>> label is the same as the match() start label. Moreover, if there is no
>> start label to match()-step, then then nothing is pulled out. In your case,
>> the as('y').has('name','marko') can't "go first" as you need to bind "y"
>> first.
>> 
>> Now, if you had as("x").has("name","marko") as well as
>> as("x").has("age",lt(60)), both would be pulled out and thus, available to
>> the vendor for index lookups as they please.
>> 
>> Marko.
>> 
>> http://markorodriguez.com
>> 
>> On Jun 24, 2015, at 5:32 PM, Matthias Broecheler <[email protected]> wrote:
>> 
>>> Consider this example:
>>> 
>>> g.V.match("x", as('x').has('age',lt(60)), as('x').out('knows').as('y'),
>>> as('y').has('name','marko'))
>>> 
>>> In this case the age constraint would be pulled out if I understand
>>> correctly. But this constraint has very poor selectivity in particular
>>> compared to the has('name','marko') constraint on 'y'. So, the better way
>>> to execute this match would be to start by retrieving all markos, finding
>>> the people who know them and then filter those by age.
>>> However, that is not possible if you pull out the age constraint.
>>> 
>>> 
>>> On Wed, Jun 24, 2015 at 4:11 PM Marko Rodriguez <[email protected]>
>>> wrote:
>>> 
>>>> Hi Matthias,
>>>> 
>>>> So the has()-container "pulling" only happens if a startLabel is
>> provided
>>>> (i.e. match("x", as("x").has("name","matthias")). And in that case, I
>> can't
>>>> imagine it ever not being desired as if you leave it in MatchStep, then
>> you
>>>> have one more pattern to order, keep runtime statistics on, cycle
>> through
>>>> for determine if a match has occurred, deduping on, and one more pattern
>>>> label to add to each match, etc. By pulling out the has()-container, you
>>>> can reduce the overhead in MatchStep. Finally, while I said it was "for
>>>> vendor indexing," its really not just about that because if the vendor
>>>> can't use it for indexing, its still good to have it outside the match()
>>>> for the stated reasons.
>>>> 
>>>> Hope that is clear,
>>>> Marko.
>>>> 
>>>> http://markorodriguez.com
>>>> 
>>>> On Jun 19, 2015, at 12:07 PM, Matthias Broecheler <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi Marko,
>>>>> 
>>>>> is it possible to disable pulling out the has-containers? For many
>>>> graphdb
>>>>> vendors it would make sense to leave the has containers in the match
>> step
>>>>> and then select those has containers that promise the highest
>> selectivity
>>>>> for index calls based on the index statistics. Since TP3 isn't aware of
>>>>> indexes it could make such a call.
>>>>> 
>>>>> Thanks,
>>>>> Matthias
>>>>> 
>>>>> On Fri, Jun 19, 2015 at 10:42 AM Marko Rodriguez <[email protected]
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> So, this morning I realized something neat about MatchStep<->WhereStep
>>>>>> interplay.
>>>>>> 
>>>>>> First, MatchWhereStrategy is now called MatchPredicateStrategy as it
>> is
>>>>>> about moving predicates in and out of match().
>>>>>>      - where()s go in.
>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/apache/incubator-tinkerpop/blob/2e3a25c318136b7f6c1aec5fae2c0c1b950fb3f9/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/strategy/optimization/MatchPredicateStrategy.java#L69
>>>>>>      - has() containers go out.
>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/apache/incubator-tinkerpop/blob/2e3a25c318136b7f6c1aec5fae2c0c1b950fb3f9/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/strategy/optimization/MatchPredicateStrategy.java#L80
>>>>>> 
>>>>>> Next, the question about "predicate traversals" in MatchStep is solved
>>>> by
>>>>>> simply saying:
>>>>>>      "If you want a predicate traversal, use a where()-clause in your
>>>>>> pattern."
>>>>>> 
>>>>>> Thats it! Lets look at what I mean by that (Josh and Daniel will
>>>>>> understand the ramifications best).
>>>>>> 
>>>>>> gremlin> g.V().match('a',
>>>>>> __.as('a').out('created').as('b'),
>>>>>> __.as('a').repeat(out()).times(2))
>>>>>> ==>[a:v[1], b:v[3]]
>>>>>> ==>[a:v[1], b:v[3]]
>>>>>> 
>>>>>> The above match() returns duplicates. Why? Because the second pattern
>>>>>> isn't binding, its just "checking" -- that is, it passes the traverser
>>>>>> through and if that traverser splits, well, there are more traversers
>>>>>> returned. In the original MatchStep, these were called "predicate
>>>>>> traversals" because they did not bind variables (i.e. no as() at the
>>>> end).
>>>>>> As such, their output didn't matter. However, in the new MatchStep, I
>>>> can't
>>>>>> do that so easily given the OLAP constraint. However, if you want
>>>>>> "predicate traversal" behavior, use WhereStep!
>>>>>> 
>>>>>> g.V().match('a',
>>>>>> __.as('a').out('created').as('b'),
>>>>>> __.where(__.as('a').repeat(out()).times(2))
>>>>>> )
>>>>>> ==>[a:v[1], b:v[3]]
>>>>>> 
>>>>>> So, if you don't care about the result of a pattern, only if it
>>>>>> "hasNext()" (which is much faster than "iterate()"), then wrap it in a
>>>>>> where() and there you go. Not only is this way more efficient as you
>> are
>>>>>> not generating traversers (i.e. results), you are also not creating
>>>>>> duplicate results (i.e. traversers with similar path histories).
>>>>>> 
>>>>>> Finally, note you can also do this for a nice look and feel:
>>>>>> 
>>>>>> g.V().match('a',
>>>>>> __.as('a').out('created').as('b'),
>>>>>> __.as('a').where(repeat(out()).times(2))
>>>>>> )
>>>>>> ==>[a:v[1], b:v[3]]
>>>>>> 
>>>>>> So whats the catch? Why not just wrap all match patterns without an
>>>>>> end-label step in where()? Two reasons:
>>>>>>      1. Semantics. MatchStep is set of traversals where the traverser
>>>>>> is pushed into the traversals and when there are no more traversals to
>>>>>> take, it goes to the next step. Its not a filter-step, its a map-step.
>>>>>>      2. OLAP. WhereStep's internal traversal is a "local child" and
>>>>>> thus, can only compute as far as the local star graph in OLAP.
>>>> Typically,
>>>>>> any step that needs to know what happened at the end of an internal
>>>>>> traversal (filter or not) has to be locally bound. … this is the
>>>>>> fundamental difference between Gremlin OLAP and Gremlin OLTP.
>>>>>> 
>>>>>> Finally finally….the last big issue I was having was "not()" inside
>>>> Match.
>>>>>> Again, because MatchStep uses "global children", it can't know what
>>>>>> happened to the traverser once it enters a pattern. And steps like NOT
>>>> need
>>>>>> to know if the traverser was filtered. Well, not() in where() works
>>>> great:
>>>>>> 
>>>>>> g.V().as('a').out('created').
>>>>>> where(__.in('created').count().is(gt(1))).values('name')
>>>>>> ==>lop
>>>>>> ==>lop
>>>>>> ==>lop
>>>>>> g.V().as('a').out('created').
>>>>>> where(__.not(__.in('created').count().is(gt(1)))).values('name') //
>> it
>>>>>> sucks that groovy requires not and in to have __.
>>>>>> ==>ripple
>>>>>> 
>>>>>> And guess what, if you want to NOT a pattern in match(), do it via
>>>> where()!
>>>>>> 
>>>>>> g.V().match('a',
>>>>>> __.as('a').out('created').as('b'),
>>>>>> __.as('b').where(__.in('created').count().is(gt(1)))).
>>>>>>  select().by('name')
>>>>>> ==>[a:marko, b:lop]
>>>>>> ==>[a:josh, b:lop]
>>>>>> ==>[a:peter, b:lop]
>>>>>> g.V().match('a',
>>>>>> __.as('a').out('created').as('b'),
>>>>>> __.as('b').where(__.not(__.in('created').count().is(gt(1))))).
>>>>>>  select().by('name')
>>>>>> ==>[a:josh, b:ripple]
>>>>>> 
>>>>>> And there we go. MatchPredicateStrategy can just throw where()-steps
>>>> into
>>>>>> MatchStep as is and the issue of "predicate traversals" is no longer
>> an
>>>>>> issue.
>>>>>> 
>>>>>> Thanks for reading,
>>>>>> Marko.
>>>>>> 
>>>>>> http://markorodriguez.com
>>>>>> 
>>>>>> On Jun 17, 2015, at 4:25 PM, Marko Rodriguez <[email protected]>
>>>> wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> To extend on Kuppitz' comment -- Yes, MatchWhereStrategy folds in
>>>>>> where()-clauses. Note that with the recent work on XMatchStep (if we
>> go
>>>>>> with that for GA), where() clauses work natively in XMatchStep and we
>>>> will
>>>>>> also just fold any "right handed" where()-clauses into match() as
>> well.
>>>>>>> 
>>>>>>> Marko.
>>>>>>> 
>>>>>>> http://markorodriguez.com
>>>>>>> 
>>>>>>> On Jun 17, 2015, at 3:04 PM, Daniel Kuppitz <[email protected]> wrote:
>>>>>>> 
>>>>>>>> After actually looking into the docs, I decided to keep the example,
>>>>>> since
>>>>>>>> the description explicitely states, that in such a case the where()
>>>>>> clause
>>>>>>>> will automatically be folded into match():
>>>>>>>> 
>>>>>>>> The where()-step can take either a BiPredicate (first example below)
>>>> or
>>>>>> a
>>>>>>>>> Traversal (second example below). Using MatchWhereStrategy,
>>>>>> where()-clauses
>>>>>>>>> can be automatically folded into match() and thus, subject to
>>>>>> match()-steps
>>>>>>>>> budget-match algorithm.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> The sample then shows, that
>>>>>>>> 
>>>>>>>> g.V().match('a',
>>>>>>>> __.as('a').out('created').as('b'),
>>>>>>>> __.as('b').in('created').as('c')).
>>>>>>>> where(__.as('a').out('knows').as('c')).
>>>>>>>> select('a','c').by('name')
>>>>>>>> 
>>>>>>>> 
>>>>>>>> is - after the MatchWhereStrategy was applied (this is done
>>>>>> automatically)
>>>>>>>> - in fact the same thing as:
>>>>>>>> 
>>>>>>>> g.V().match('a',
>>>>>>>> __.as('a').out('created').as('b'),
>>>>>>>> __.as('a').out('knows').as('c'),
>>>>>>>> __.as('b').in('created').as('c')).
>>>>>>>> select('a','c').by('name')
>>>>>>>> 
>>>>>>>> ....
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Daniel
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Jun 17, 2015 at 10:47 PM, Daniel Kuppitz <[email protected]>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> You're right. It's actually a pretty good example for where(), but
>>>> not
>>>>>>>>> for match()/where(). I will remove it and make sure that we have
>>>>>>>>> something similar in the where() sample section. Something like:
>>>>>>>>> 
>>>>>>>>> g.V().as("a").out("created").as("b").in("created").as("c").
>>>>>>>>> where(__as("a").out("knows").as("c")).select().by("name")
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Daniel
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Jun 17, 2015 at 8:43 PM, Matthias Broecheler <
>>>> [email protected]
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi guys,
>>>>>>>>>> 
>>>>>>>>>> looking at the second example in the following section of the
>> docs I
>>>>>>>>>> noticed a semantic overlap between match and where:
>>>>>>>>>> 
>>>> http://www.tinkerpop.com/docs/3.0.0-SNAPSHOT/#using-where-with-match
>>>>>>>>>> 
>>>>>>>>>> traversal = g.V().match('a', __.as('a').out('created').as('b'),
>>>>>> __.as('b'
>>>>>>>>>> ).in('created').as('c')). where(__.as('a').out('knows').as('c')).
>>>>>>>>>> select('a'
>>>>>>>>>> ,'c').by('name');
>>>>>>>>>> 
>>>>>>>>>> The provided where clause could also have been folded into the
>>>> actual
>>>>>>>>>> traversal to yield the same result.
>>>>>>>>>> I wonder:
>>>>>>>>>> 1) Is there a way to avoid this ambiguity?
>>>>>>>>>> 2) or should we simply not promote it in the docs. As the docs are
>>>>>>>>>> currently written I am worried that users might get confused as to
>>>> how
>>>>>>>>>> match steps are supposed to be written.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Matthias
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Semantic overlap between Match and Where

Reply via email to