Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Taewoo Kim Mon, 11 Jan 2016 09:43:11 -0800

Yes. As Yingyi said in a previous e-mail, during the logical/physical plan,
probe (left branch of join) / build (right branch of join) is
systematically decided and it will be kept if "role reversal" will not
happen.


Best,
Taewoo

On Sun, Jan 10, 2016 at 9:16 PM, Yingyi Bu <[email protected]> wrote:

> >> the role reversal happens locally in later iterations of the hash join,
> right, based on the relative sizes of on-disk hash partitions?
> That's right.
>
> >>I think the key thing that we need to be consistent about is which is the
> "basic" outer and inner
> >> - which in the case of a parallel hash join would be which is the
> initial build dataset and which is the initial probe dataset.
> Got it!
> Thanks!
>
> Best,
> Yingyi
>
>
> On Sun, Jan 10, 2016 at 3:59 PM, Mike Carey <[email protected]> wrote:
>
> > Just to be (or get) clear - the role reversal happens locally in later
> > iterations of the hash join, right, based on the relative sizes of
> on-disk
> > hash partitions?  I think the key thing that we need to be consistent
> about
> > is which is the "basic" outer and inner - which in the case of a parallel
> > hash join would be which is the initial build dataset and which is the
> > initial probe dataset.
> >
> >
> > On 1/8/16 9:08 AM, Yingyi Bu wrote:
> >
> >> In left-outer hash join, if the the probe branch is locally clustered
> (or
> >> sorted) by a column superset of the join key,
> >> the output will still be locally clustered.
> >> Inner hash join couldn't maintain that because of the "role reversal"
> >> optimization in the runtime.
> >>
> >> Best,
> >> Yingyi
> >>
> >> On Thu, Jan 7, 2016 at 10:07 PM, Taewoo Kim <[email protected]> wrote:
> >>
> >> Interesting. Can you be more specific?
> >>>
> >>> Best,
> >>> Taewoo
> >>>
> >>> On Thu, Jan 7, 2016 at 6:36 PM, Yingyi Bu <[email protected]> wrote:
> >>>
> >>> Sorry, to be more precise:
> >>>> Left-outer hash join cannot preserve all local data properties for its
> >>>> probe branch (because spilling can happen) but can preserve (or
> >>>>
> >>> downgrade)
> >>>
> >>>> some when certain conditions meet.
> >>>>
> >>>> On Thu, Jan 7, 2016 at 6:28 PM, Yingyi Bu <[email protected]> wrote:
> >>>>
> >>>> In the logical/physical query plan, I think it is statically
> >>>>>
> >>>> determined.
> >>>
> >>>> However, that doesn't mean the execution is faithful to that
> >>>>>
> >>>> probe/build
> >>>
> >>>> decision because we have the "role reversal" optimization for inner
> >>>>>
> >>>> hash
> >>>
> >>>> joins:-)
> >>>>> (That's also why our inner hash join cannot maintain any local data
> >>>>> property from its probe branch, but left-outer hash can preserve
> that.)
> >>>>>
> >>>>> Best,
> >>>>> Yingyi
> >>>>>
> >>>>>
> >>>>> On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <[email protected]>
> wrote:
> >>>>>
> >>>>> Also, do we (separately want to make sure that our hash join behavior
> >>>>>>
> >>>>> is
> >>>
> >>>> comparable - i.e., that the initial build/probe decision is statically
> >>>>>> determined from the query?  (I think we do want that, and I think it
> >>>>>>
> >>>>> is
> >>>
> >>>> in
> >>>>
> >>>>> fact that way - but I'm not 100% sure, as its been awhile since that
> >>>>>>
> >>>>> was
> >>>
> >>>> discussed, and it's not in-cache for me. :-))
> >>>>>>
> >>>>>>
> >>>>>> On 1/7/16 3:22 PM, Taewoo Kim wrote:
> >>>>>>
> >>>>>> Thanks Yingyi. Yes. If there is an equality condition and if we
> can't
> >>>>>>> transform a join into an index-nested loop join, then a hybrid hash
> >>>>>>>
> >>>>>> join
> >>>>
> >>>>> will be used.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Taewoo
> >>>>>>>
> >>>>>>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <[email protected]>
> >>>>>>>
> >>>>>> wrote:
> >>>
> >>>> +1!
> >>>>>>>
> >>>>>>>> 3. We only try to use applicable indexes from the inner branch.
> So,
> >>>>>>>>>
> >>>>>>>> if
> >>>>
> >>>>> there are no applicable indexes from the inner branch, we abort
> >>>>>>>>>> transforming a join into an index-nested-loop join.
> >>>>>>>>>>
> >>>>>>>>>> "index-nested-loop join" -> "hybrid hash join"?
> >>>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> Yingyi
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <[email protected]>
> >>>>>>>>
> >>>>>>> wrote:
> >>>>
> >>>>> Hello dev,
> >>>>>>>>
> >>>>>>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an
> >>>>>>>>> index-join
> >>>>>>>>> hint clarification. Let's start with an example query other than
> a
> >>>>>>>>> self-join query in this issue.
> >>>>>>>>>
> >>>>>>>>> for $c in dataset('Customers')
> >>>>>>>>>
> >>>>>>>>> for $o in dataset('Orders')
> >>>>>>>>>
> >>>>>>>>> where $c.cid /*+ indexnl */ = $o.cid
> >>>>>>>>>
> >>>>>>>>> order by $c.cid, $o.oid
> >>>>>>>>>
> >>>>>>>>> return {"cid":$c.cid, "oid": $o.oid}
> >>>>>>>>>
> >>>>>>>>> Right now, in the master branch, the first dataset (e.g.,
> >>>>>>>>>
> >>>>>>>> Customers)
> >>>
> >>>> becomes the outer branch and the second dataset (e.g., Orders)
> >>>>>>>>>
> >>>>>>>> becomes
> >>>>
> >>>>> the
> >>>>>>>>
> >>>>>>>> inner branch. And, when the optimizer tries to honor the given
> >>>>>>>>>
> >>>>>>>> indexnl
> >>>>
> >>>>> hint
> >>>>>>>>
> >>>>>>>> (transforming a join into an index-nested-loop join), if there are
> >>>>>>>>> applicable indexes from the inner branch (e.g., Orders), then it
> is
> >>>>>>>>> going
> >>>>>>>>> to use one of those indexes. If there are no applicable indexes
> >>>>>>>>>
> >>>>>>>> from
> >>>
> >>>> the
> >>>>>>>>> inner branch, it tries to use indexes from the outer branch
> (e.g.,
> >>>>>>>>> Customers). We are going to change the last part; we will not use
> >>>>>>>>> indexes
> >>>>>>>>> from the outer branch. So, the following are refined rules for
> >>>>>>>>>
> >>>>>>>>> transforming
> >>>>>>>>
> >>>>>>>> a join into an index-nested-loop join.
> >>>>>>>>>
> >>>>>>>>> 1. The first dataset in a join (the first parameter of the given
> >>>>>>>>>
> >>>>>>>> join)
> >>>>
> >>>>> becomes the outer branch.
> >>>>>>>>> 2. The second dataset in a join (the second parameter of the
> given
> >>>>>>>>> join)
> >>>>>>>>> becomes the inner branch.
> >>>>>>>>> 3. We only try to use applicable indexes from the inner branch.
> So,
> >>>>>>>>>
> >>>>>>>> if
> >>>>
> >>>>> there are no applicable indexes from the inner branch, we abort
> >>>>>>>>> transforming a join into an index-nested-loop join.
> >>>>>>>>> 4. Variable order in the given join predicate is not important.
> It
> >>>>>>>>>
> >>>>>>>> can
> >>>>
> >>>>> be
> >>>>>>>>> either outer.fieldA = inner.fieldB or inner.fieldB =
> outer.fieldA.
> >>>>>>>>>
> >>>>>>>>> So, for the left-outer join and inner join altogether, the left
> >>>>>>>>>
> >>>>>>>> subtree
> >>>>
> >>>>> is
> >>>>>>>>
> >>>>>>>> the probing side and the right subtree is the index side. So, this
> >>>>>>>>>
> >>>>>>>> can
> >>>>
> >>>>> be
> >>>>>>>>> applied to the self-join case, too just like the following query
> in
> >>>>>>>>> this
> >>>>>>>>> issue. In the following query, $t1 becomes the outer and $t2
> >>>>>>>>>
> >>>>>>>> becomes
> >>>
> >>>> the
> >>>>>>>>> inner.
> >>>>>>>>>
> >>>>>>>>> for $t1 in dataset('TweetMessages')
> >>>>>>>>>
> >>>>>>>>> for $t2 in dataset('TweetMessages')
> >>>>>>>>>
> >>>>>>>>> let $c := $t1.countA + 20
> >>>>>>>>>
> >>>>>>>>> where $c /* +indexnl */= $t2.countB
> >>>>>>>>>
> >>>>>>>>> order by $t2.tweetid
> >>>>>>>>>
> >>>>>>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
> >>>>>>>>>
> >>>>>>>>> Thank you. Any opinion would be appreciated before I finalize
> this
> >>>>>>>>>
> >>>>>>>> fix.
> >>>>
> >>>>> Best,
> >>>>>>>>> Taewoo
> >>>>>>>>>
> >>>>>>>>> ---------- Forwarded message ----------
> >>>>>>>>> From: Yingyi Bu (JIRA) <[email protected]>
> >>>>>>>>> Date: Wed, Jan 6, 2016 at 10:23 AM
> >>>>>>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join
> >>>>>>>>>
> >>>>>>>> chooses
> >>>
> >>>> wrong
> >>>>>>>>
> >>>>>>>> probe/index branch
> >>>>>>>>> To: [email protected]
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>       [
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>
> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
> >>>
> >>>> ]
> >>>>>>>>>
> >>>>>>>>> Yingyi Bu commented on ASTERIXDB-1249:
> >>>>>>>>> --------------------------------------
> >>>>>>>>>
> >>>>>>>>> That's right. Basically in AcessMethod implementations, e.g.:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>
> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
> >>>
> >>>> Choosing probe/index branch is only based on dataset names, instead
> >>>>>>>>>
> >>>>>>>> of
> >>>>
> >>>>> being based on the join condition.
> >>>>>>>>>
> >>>>>>>>> Self index join chooses wrong probe/index branch
> >>>>>>>>>
> >>>>>>>>>> ------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>>                   Key: ASTERIXDB-1249
> >>>>>>>>>>                   URL:
> >>>>>>>>>>
> >>>>>>>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
> >>>>>>>>>
> >>>>>>>>>               Project: Apache AsterixDB
> >>>>>>>>>>            Issue Type: Bug
> >>>>>>>>>>            Components: Optimizer
> >>>>>>>>>>              Reporter: Yingyi Bu
> >>>>>>>>>>              Assignee: Taewoo Kim
> >>>>>>>>>>
> >>>>>>>>>> DDLs:
> >>>>>>>>>> {noformat}
> >>>>>>>>>> drop dataverse test if exists;
> >>>>>>>>>> create dataverse test;
> >>>>>>>>>> use dataverse test;
> >>>>>>>>>> create type TwitterUserType as closed {
> >>>>>>>>>>       screen-name: string,
> >>>>>>>>>>       lang: string,
> >>>>>>>>>>       friends-count: int64,
> >>>>>>>>>>       statuses-count: int64,
> >>>>>>>>>>       name: string,
> >>>>>>>>>>       followers-count: int64
> >>>>>>>>>> }
> >>>>>>>>>> create type TweetMessageType as closed {
> >>>>>>>>>>       tweetid: int64,
> >>>>>>>>>>           user: TwitterUserType,
> >>>>>>>>>>           sender-location: point,
> >>>>>>>>>>       send-time: datetime,
> >>>>>>>>>>           referred-topics: {{ string }},
> >>>>>>>>>>       message-text: string,
> >>>>>>>>>>       countA: int64,
> >>>>>>>>>>       countB: int64
> >>>>>>>>>> }
> >>>>>>>>>> create dataset TweetMessages(TweetMessageType)
> >>>>>>>>>> primary key tweetid;
> >>>>>>>>>> create index twmSndLocIx on TweetMessages(sender-location) type
> >>>>>>>>>>
> >>>>>>>>> rtree;
> >>>>
> >>>>> create index msgCountAIx on TweetMessages(countA) type btree;
> >>>>>>>>>> create index msgCountBIx on TweetMessages(countB) type btree;
> >>>>>>>>>> create index msgTextIx on TweetMessages(message-text) type
> >>>>>>>>>>
> >>>>>>>>> keyword;
> >>>
> >>>> {noformat}
> >>>>>>>>>> Query:
> >>>>>>>>>> {noformat}
> >>>>>>>>>> for $t1 in dataset('TweetMessages')
> >>>>>>>>>> for $t2 in dataset('TweetMessages')
> >>>>>>>>>> let $n :=  create-circle($t1.sender-location, 0.5)
> >>>>>>>>>> where spatial-intersect($t2.sender-location, $n)
> >>>>>>>>>> order by $t2.tweetid
> >>>>>>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
> >>>>>>>>>> {noformat}
> >>>>>>>>>> Optimized plan:
> >>>>>>>>>> {noformat}
> >>>>>>>>>> distribute result [%0->$$10]
> >>>>>>>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
> >>>>>>>>>>     exchange
> >>>>>>>>>>     -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>>>>>       project ([$$10])
> >>>>>>>>>>       -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>>>>>         assign [$$10] <- [function-call:
> >>>>>>>>>>
> >>>>>>>>>> asterix:closed-record-constructor,
> >>>>>>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
> >>>>>>>>>
> >>>>>>>>>         -- ASSIGN  |PARTITIONED|
> >>>>>>>>>>           exchange
> >>>>>>>>>>           -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
> >>>>>>>>>>             order (ASC, %0->$$15)
> >>>>>>>>>>             -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
> >>>>>>>>>>               exchange
> >>>>>>>>>>               -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>>>>>                 project ([$$13, $$15])
> >>>>>>>>>>                 -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>>>>>                   select (function-call:
> >>>>>>>>>> asterix:spatial-intersect,
> >>>>>>>>>>
> >>>>>>>>>> Args:[%0->$$13, function-call: asterix:create-circle,
> >>>>>>>>>
> >>>>>>>>> Args:[function-call:
> >>>>>>>>
> >>>>>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}],
> >>>>>>>>>
> >>>>>>>> ADouble:
> >>>
> >>>> {0.5}]])
> >>>>>>>>>
> >>>>>>>>>                   -- STREAM_SELECT  |PARTITIONED|
> >>>>>>>>>>                     project ([$$0, $$13, $$15])
> >>>>>>>>>>                     -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>>>>>                       exchange
> >>>>>>>>>>                       -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>>>>>                         unnest-map [$$14, $$0] <- function-call:
> >>>>>>>>>>
> >>>>>>>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32:
> {0},
> >>>>>>>>>
> >>>>>>>>> AString:
> >>>>>>>>
> >>>>>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean:
> >>>>>>>>>
> >>>>>>>> {false},
> >>>>
> >>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27,
> >>>>>>>>>
> >>>>>>>> TRUE,
> >>>>
> >>>>> TRUE, TRUE]
> >>>>>>>>>
> >>>>>>>>>                         -- BTREE_SEARCH  |PARTITIONED|
> >>>>>>>>>>                           exchange
> >>>>>>>>>>                           -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >>>>>>>>>>                             order (ASC, %0->$$27)
> >>>>>>>>>>                             -- STABLE_SORT [$$27(ASC)]
> >>>>>>>>>>
> >>>>>>>>> |PARTITIONED|
> >>>
> >>>>                               exchange
> >>>>>>>>>>                               -- ONE_TO_ONE_EXCHANGE
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>>                                 project ([$$27, $$13, $$15])
> >>>>>>>>>>                                 -- STREAM_PROJECT  |PARTITIONED|
> >>>>>>>>>>                                   exchange
> >>>>>>>>>>                                   -- ONE_TO_ONE_EXCHANGE
> >>>>>>>>>>
> >>>>>>>>> |PARTITIONED|
> >>>>
> >>>>>                                     unnest-map [$$23, $$24, $$25,
> >>>>>>>>>>
> >>>>>>>>> $$26,
> >>>>
> >>>>> $$27] <- function-call: asterix:index-search, Args:[AString:
> >>>>>>>>>
> >>>>>>>>> {twmSndLocIx},
> >>>>>>>>
> >>>>>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
> >>>>>>>>> {true},
> >>>>>>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19,
> >>>>>>>>>
> >>>>>>>> %0->$$20,
> >>>
> >>>> %0->$$21, %0->$$22]
> >>>>>>>>>
> >>>>>>>>>                                     -- RTREE_SEARCH
> |PARTITIONED|
> >>>>>>>>>>                                       exchange
> >>>>>>>>>>                                       -- BROADCAST_EXCHANGE
> >>>>>>>>>>
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>                                         assign [$$19, $$20, $$21,
> >>>>>>>>>
> >>>>>>>> $$22]
> >>>>
> >>>>> <-
> >>>>>>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
> >>>>>>>>> AInt32:
> >>>>>>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
> >>>>>>>>>
> >>>>>>>> {2},
> >>>
> >>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13,
> >>>>>>>>> AInt32:
> >>>>>>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr,
> >>>>>>>>>
> >>>>>>>> Args:[%0->$$13,
> >>>
> >>>> AInt32: {2}, AInt32: {3}]]
> >>>>>>>>>
> >>>>>>>>>                                         -- ASSIGN  |PARTITIONED|
> >>>>>>>>>>                                           project ([$$13, $$15])
> >>>>>>>>>>                                           -- STREAM_PROJECT
> >>>>>>>>>>
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>                                             assign [$$13] <-
> >>>>>>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1,
> >>>>>>>>>
> >>>>>>>> AInt32:
> >>>
> >>>> {2}]]
> >>>>>>>>
> >>>>>>>>                                             -- ASSIGN
> |PARTITIONED|
> >>>>>>>>>
> >>>>>>>>>>                                               exchange
> >>>>>>>>>>                                               --
> >>>>>>>>>>
> >>>>>>>>> ONE_TO_ONE_EXCHANGE
> >>>
> >>>> |PARTITIONED|
> >>>>>>>>>
> >>>>>>>>>                                                 data-scan
> >>>>>>>>>>
> >>>>>>>>> []<-[$$15,
> >>>
> >>>> $$1]
> >>>>>>>>>>
> >>>>>>>>>> <- test:TweetMessages
> >>>>>>>>>
> >>>>>>>>>                                                 --
> DATASOURCE_SCAN
> >>>>>>>>>>
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>
> >>>>>>>>>                                                   exchange
> >>>>>>>>>>                                                   --
> >>>>>>>>>> ONE_TO_ONE_EXCHANGE
> >>>>>>>>>>
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>
> >>>>>>>>> empty-tuple-source
> >>>>
> >>>>>                                                     --
> >>>>>>>>>> EMPTY_TUPLE_SOURCE
> >>>>>>>>>>
> >>>>>>>>>> |PARTITIONED|
> >>>>>>>>>
> >>>>>>>>> {noformat}
> >>>>>>>>>> The optimized plan is incorrect --- the index search doesn't use
> >>>>>>>>>>
> >>>>>>>>> the
> >>>
> >>>> right join condition and hence the result is different from
> >>>>>>>>>
> >>>>>>>> expected.
> >>>
> >>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> This message was sent by Atlassian JIRA
> >>>>>>>>> (v6.3.4#6332)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >
>

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Reply via email to