Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Taewoo Kim Thu, 07 Jan 2016 22:08:53 -0800

Interesting. Can you be more specific?

Best,
Taewoo


On Thu, Jan 7, 2016 at 6:36 PM, Yingyi Bu <[email protected]> wrote:

> Sorry, to be more precise:
> Left-outer hash join cannot preserve all local data properties for its
> probe branch (because spilling can happen) but can preserve (or downgrade)
> some when certain conditions meet.
>
> On Thu, Jan 7, 2016 at 6:28 PM, Yingyi Bu <[email protected]> wrote:
>
> > In the logical/physical query plan, I think it is statically determined.
> > However, that doesn't mean the execution is faithful to that probe/build
> > decision because we have the "role reversal" optimization for inner hash
> > joins:-)
> > (That's also why our inner hash join cannot maintain any local data
> > property from its probe branch, but left-outer hash can preserve that.)
> >
> > Best,
> > Yingyi
> >
> >
> > On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <[email protected]> wrote:
> >
> >> Also, do we (separately want to make sure that our hash join behavior is
> >> comparable - i.e., that the initial build/probe decision is statically
> >> determined from the query?  (I think we do want that, and I think it is
> in
> >> fact that way - but I'm not 100% sure, as its been awhile since that was
> >> discussed, and it's not in-cache for me. :-))
> >>
> >>
> >> On 1/7/16 3:22 PM, Taewoo Kim wrote:
> >>
> >>> Thanks Yingyi. Yes. If there is an equality condition and if we can't
> >>> transform a join into an index-nested loop join, then a hybrid hash
> join
> >>> will be used.
> >>>
> >>> Best,
> >>> Taewoo
> >>>
> >>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <[email protected]> wrote:
> >>>
> >>> +1!
> >>>>
> >>>>> 3. We only try to use applicable indexes from the inner branch. So,
> if
> >>>>>> there are no applicable indexes from the inner branch, we abort
> >>>>>> transforming a join into an index-nested-loop join.
> >>>>>>
> >>>>> "index-nested-loop join" -> "hybrid hash join"?
> >>>>
> >>>> Thanks!
> >>>>
> >>>> Yingyi
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <[email protected]>
> wrote:
> >>>>
> >>>> Hello dev,
> >>>>>
> >>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an
> >>>>> index-join
> >>>>> hint clarification. Let's start with an example query other than a
> >>>>> self-join query in this issue.
> >>>>>
> >>>>> for $c in dataset('Customers')
> >>>>>
> >>>>> for $o in dataset('Orders')
> >>>>>
> >>>>> where $c.cid /*+ indexnl */ = $o.cid
> >>>>>
> >>>>> order by $c.cid, $o.oid
> >>>>>
> >>>>> return {"cid":$c.cid, "oid": $o.oid}
> >>>>>
> >>>>> Right now, in the master branch, the first dataset (e.g., Customers)
> >>>>> becomes the outer branch and the second dataset (e.g., Orders)
> becomes
> >>>>>
> >>>> the
> >>>>
> >>>>> inner branch. And, when the optimizer tries to honor the given
> indexnl
> >>>>>
> >>>> hint
> >>>>
> >>>>> (transforming a join into an index-nested-loop join), if there are
> >>>>> applicable indexes from the inner branch (e.g., Orders), then it is
> >>>>> going
> >>>>> to use one of those indexes. If there are no applicable indexes from
> >>>>> the
> >>>>> inner branch, it tries to use indexes from the outer branch (e.g.,
> >>>>> Customers). We are going to change the last part; we will not use
> >>>>> indexes
> >>>>> from the outer branch. So, the following are refined rules for
> >>>>>
> >>>> transforming
> >>>>
> >>>>> a join into an index-nested-loop join.
> >>>>>
> >>>>> 1. The first dataset in a join (the first parameter of the given
> join)
> >>>>> becomes the outer branch.
> >>>>> 2. The second dataset in a join (the second parameter of the given
> >>>>> join)
> >>>>> becomes the inner branch.
> >>>>> 3. We only try to use applicable indexes from the inner branch. So,
> if
> >>>>> there are no applicable indexes from the inner branch, we abort
> >>>>> transforming a join into an index-nested-loop join.
> >>>>> 4. Variable order in the given join predicate is not important. It
> can
> >>>>> be
> >>>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
> >>>>>
> >>>>> So, for the left-outer join and inner join altogether, the left
> subtree
> >>>>>
> >>>> is
> >>>>
> >>>>> the probing side and the right subtree is the index side. So, this
> can
> >>>>> be
> >>>>> applied to the self-join case, too just like the following query in
> >>>>> this
> >>>>> issue. In the following query, $t1 becomes the outer and $t2 becomes
> >>>>> the
> >>>>> inner.
> >>>>>
> >>>>> for $t1 in dataset('TweetMessages')
> >>>>>
> >>>>> for $t2 in dataset('TweetMessages')
> >>>>>
> >>>>> let $c := $t1.countA + 20
> >>>>>
> >>>>> where $c /* +indexnl */= $t2.countB
> >>>>>
> >>>>> order by $t2.tweetid
> >>>>>
> >>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
> >>>>>
> >>>>> Thank you. Any opinion would be appreciated before I finalize this
> fix.
> >>>>>
> >>>>> Best,
> >>>>> Taewoo
> >>>>>
> >>>>> ---------- Forwarded message ----------
> >>>>> From: Yingyi Bu (JIRA) <[email protected]>
> >>>>> Date: Wed, Jan 6, 2016 at 10:23 AM
> >>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses
> >>>>>
> >>>> wrong
> >>>>
> >>>>> probe/index branch
> >>>>> To: [email protected]
> >>>>>
> >>>>>
> >>>>>
> >>>>>      [
> >>>>>
> >>>>>
> >>>>>
> >>>>
> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
> >>>>
> >>>>> ]
> >>>>>
> >>>>> Yingyi Bu commented on ASTERIXDB-1249:
> >>>>> --------------------------------------
> >>>>>
> >>>>> That's right. Basically in AcessMethod implementations, e.g.:
> >>>>>
> >>>>>
> >>>>>
> >>>>
> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
> >>>>
> >>>>> Choosing probe/index branch is only based on dataset names, instead
> of
> >>>>> being based on the join condition.
> >>>>>
> >>>>> Self index join chooses wrong probe/index branch
> >>>>>> ------------------------------------------------
> >>>>>>
> >>>>>>                  Key: ASTERIXDB-1249
> >>>>>>                  URL:
> >>>>>>
> >>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
> >>>>>
> >>>>>>              Project: Apache AsterixDB
> >>>>>>           Issue Type: Bug
> >>>>>>           Components: Optimizer
> >>>>>>             Reporter: Yingyi Bu
> >>>>>>             Assignee: Taewoo Kim
> >>>>>>
> >>>>>> DDLs:
> >>>>>> {noformat}
> >>>>>> drop dataverse test if exists;
> >>>>>> create dataverse test;
> >>>>>> use dataverse test;
> >>>>>> create type TwitterUserType as closed {
> >>>>>>      screen-name: string,
> >>>>>>      lang: string,
> >>>>>>      friends-count: int64,
> >>>>>>      statuses-count: int64,
> >>>>>>      name: string,
> >>>>>>      followers-count: int64
> >>>>>> }
> >>>>>> create type TweetMessageType as closed {
> >>>>>>      tweetid: int64,
> >>>>>>          user: TwitterUserType,
> >>>>>>          sender-location: point,
> >>>>>>      send-time: datetime,
> >>>>>>          referred-topics: {{ string }},
> >>>>>>      message-text: string,
> >>>>>>      countA: int64,
> >>>>>>      countB: int64
> >>>>>> }
> >>>>>> create dataset TweetMessages(TweetMessageType)
> >>>>>> primary key tweetid;
> >>>>>> create index twmSndLocIx on TweetMessages(sender-location) type
> rtree;
> >>>>>> create index msgCountAIx on TweetMessages(countA) type btree;
> >>>>>> create index msgCountBIx on TweetMessages(countB) type btree;
> >>>>>> create index msgTextIx on TweetMessages(message-text) type keyword;
> >>>>>> {noformat}
> >>>>>> Query:
> >>>>>> {noformat}
> >>>>>> for $t1 in dataset('TweetMessages')
> >>>>>> for $t2 in dataset('TweetMessages')
> >>>>>> let $n :=  create-circle($t1.sender-location, 0.5)
> >>>>>> where spatial-intersect($t2.sender-location, $n)
> >>>>>> order by $t2.tweetid
> >>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
> >>>>>> {noformat}
> >>>>>> Optimized plan:
> >>>>>> {noformat}
> >>>>>> distribute result [%0->$$10]
> >>>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
> >>>>>>    exchange
> >>>>>>    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>      project ([$$10])
> >>>>>>      -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>        assign [$$10] <- [function-call:
> >>>>>>
> >>>>> asterix:closed-record-constructor,
> >>>>
> >>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
> >>>>>
> >>>>>>        -- ASSIGN  |PARTITIONED|
> >>>>>>          exchange
> >>>>>>          -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
> >>>>>>            order (ASC, %0->$$15)
> >>>>>>            -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
> >>>>>>              exchange
> >>>>>>              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>                project ([$$13, $$15])
> >>>>>>                -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>                  select (function-call: asterix:spatial-intersect,
> >>>>>>
> >>>>> Args:[%0->$$13, function-call: asterix:create-circle,
> >>>>>
> >>>> Args:[function-call:
> >>>>
> >>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
> >>>>> {0.5}]])
> >>>>>
> >>>>>>                  -- STREAM_SELECT  |PARTITIONED|
> >>>>>>                    project ([$$0, $$13, $$15])
> >>>>>>                    -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>                      exchange
> >>>>>>                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>                        unnest-map [$$14, $$0] <- function-call:
> >>>>>>
> >>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
> >>>>>
> >>>> AString:
> >>>>
> >>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean:
> {false},
> >>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27,
> TRUE,
> >>>>> TRUE, TRUE]
> >>>>>
> >>>>>>                        -- BTREE_SEARCH  |PARTITIONED|
> >>>>>>                          exchange
> >>>>>>                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>                            order (ASC, %0->$$27)
> >>>>>>                            -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
> >>>>>>                              exchange
> >>>>>>                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>                                project ([$$27, $$13, $$15])
> >>>>>>                                -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>                                  exchange
> >>>>>>                                  -- ONE_TO_ONE_EXCHANGE
> |PARTITIONED|
> >>>>>>                                    unnest-map [$$23, $$24, $$25,
> $$26,
> >>>>>>
> >>>>> $$27] <- function-call: asterix:index-search, Args:[AString:
> >>>>>
> >>>> {twmSndLocIx},
> >>>>
> >>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
> >>>>> {true},
> >>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
> >>>>> %0->$$21, %0->$$22]
> >>>>>
> >>>>>>                                    -- RTREE_SEARCH  |PARTITIONED|
> >>>>>>                                      exchange
> >>>>>>                                      -- BROADCAST_EXCHANGE
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>
> >>>>>                                        assign [$$19, $$20, $$21,
> $$22]
> >>>>>>
> >>>>> <-
> >>>>
> >>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
> >>>>> AInt32:
> >>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
> >>>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13,
> >>>>> AInt32:
> >>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
> >>>>> AInt32: {2}, AInt32: {3}]]
> >>>>>
> >>>>>>                                        -- ASSIGN  |PARTITIONED|
> >>>>>>                                          project ([$$13, $$15])
> >>>>>>                                          -- STREAM_PROJECT
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>
> >>>>>                                            assign [$$13] <-
> >>>>>>
> >>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32:
> >>>>>
> >>>> {2}]]
> >>>>
> >>>>>                                            -- ASSIGN  |PARTITIONED|
> >>>>>>                                              exchange
> >>>>>>                                              -- ONE_TO_ONE_EXCHANGE
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>>
> >>>>>>                                                data-scan []<-[$$15,
> >>>>>> $$1]
> >>>>>>
> >>>>> <- test:TweetMessages
> >>>>>
> >>>>>>                                                -- DATASOURCE_SCAN
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>>
> >>>>>>                                                  exchange
> >>>>>>                                                  --
> >>>>>> ONE_TO_ONE_EXCHANGE
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>>
> >>>>>>
> empty-tuple-source
> >>>>>>                                                    --
> >>>>>> EMPTY_TUPLE_SOURCE
> >>>>>>
> >>>>> |PARTITIONED|
> >>>>>
> >>>>>> {noformat}
> >>>>>> The optimized plan is incorrect --- the index search doesn't use the
> >>>>>>
> >>>>> right join condition and hence the result is different from expected.
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> This message was sent by Atlassian JIRA
> >>>>> (v6.3.4#6332)
> >>>>>
> >>>>>
> >>
> >
>

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Reply via email to