Interesting. Can you be more specific? Best, Taewoo
On Thu, Jan 7, 2016 at 6:36 PM, Yingyi Bu <[email protected]> wrote: > Sorry, to be more precise: > Left-outer hash join cannot preserve all local data properties for its > probe branch (because spilling can happen) but can preserve (or downgrade) > some when certain conditions meet. > > On Thu, Jan 7, 2016 at 6:28 PM, Yingyi Bu <[email protected]> wrote: > > > In the logical/physical query plan, I think it is statically determined. > > However, that doesn't mean the execution is faithful to that probe/build > > decision because we have the "role reversal" optimization for inner hash > > joins:-) > > (That's also why our inner hash join cannot maintain any local data > > property from its probe branch, but left-outer hash can preserve that.) > > > > Best, > > Yingyi > > > > > > On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <[email protected]> wrote: > > > >> Also, do we (separately want to make sure that our hash join behavior is > >> comparable - i.e., that the initial build/probe decision is statically > >> determined from the query? (I think we do want that, and I think it is > in > >> fact that way - but I'm not 100% sure, as its been awhile since that was > >> discussed, and it's not in-cache for me. :-)) > >> > >> > >> On 1/7/16 3:22 PM, Taewoo Kim wrote: > >> > >>> Thanks Yingyi. Yes. If there is an equality condition and if we can't > >>> transform a join into an index-nested loop join, then a hybrid hash > join > >>> will be used. > >>> > >>> Best, > >>> Taewoo > >>> > >>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <[email protected]> wrote: > >>> > >>> +1! > >>>> > >>>>> 3. We only try to use applicable indexes from the inner branch. So, > if > >>>>>> there are no applicable indexes from the inner branch, we abort > >>>>>> transforming a join into an index-nested-loop join. > >>>>>> > >>>>> "index-nested-loop join" -> "hybrid hash join"? > >>>> > >>>> Thanks! > >>>> > >>>> Yingyi > >>>> > >>>> > >>>> > >>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <[email protected]> > wrote: > >>>> > >>>> Hello dev, > >>>>> > >>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an > >>>>> index-join > >>>>> hint clarification. Let's start with an example query other than a > >>>>> self-join query in this issue. > >>>>> > >>>>> for $c in dataset('Customers') > >>>>> > >>>>> for $o in dataset('Orders') > >>>>> > >>>>> where $c.cid /*+ indexnl */ = $o.cid > >>>>> > >>>>> order by $c.cid, $o.oid > >>>>> > >>>>> return {"cid":$c.cid, "oid": $o.oid} > >>>>> > >>>>> Right now, in the master branch, the first dataset (e.g., Customers) > >>>>> becomes the outer branch and the second dataset (e.g., Orders) > becomes > >>>>> > >>>> the > >>>> > >>>>> inner branch. And, when the optimizer tries to honor the given > indexnl > >>>>> > >>>> hint > >>>> > >>>>> (transforming a join into an index-nested-loop join), if there are > >>>>> applicable indexes from the inner branch (e.g., Orders), then it is > >>>>> going > >>>>> to use one of those indexes. If there are no applicable indexes from > >>>>> the > >>>>> inner branch, it tries to use indexes from the outer branch (e.g., > >>>>> Customers). We are going to change the last part; we will not use > >>>>> indexes > >>>>> from the outer branch. So, the following are refined rules for > >>>>> > >>>> transforming > >>>> > >>>>> a join into an index-nested-loop join. > >>>>> > >>>>> 1. The first dataset in a join (the first parameter of the given > join) > >>>>> becomes the outer branch. > >>>>> 2. The second dataset in a join (the second parameter of the given > >>>>> join) > >>>>> becomes the inner branch. > >>>>> 3. We only try to use applicable indexes from the inner branch. So, > if > >>>>> there are no applicable indexes from the inner branch, we abort > >>>>> transforming a join into an index-nested-loop join. > >>>>> 4. Variable order in the given join predicate is not important. It > can > >>>>> be > >>>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA. > >>>>> > >>>>> So, for the left-outer join and inner join altogether, the left > subtree > >>>>> > >>>> is > >>>> > >>>>> the probing side and the right subtree is the index side. So, this > can > >>>>> be > >>>>> applied to the self-join case, too just like the following query in > >>>>> this > >>>>> issue. In the following query, $t1 becomes the outer and $t2 becomes > >>>>> the > >>>>> inner. > >>>>> > >>>>> for $t1 in dataset('TweetMessages') > >>>>> > >>>>> for $t2 in dataset('TweetMessages') > >>>>> > >>>>> let $c := $t1.countA + 20 > >>>>> > >>>>> where $c /* +indexnl */= $t2.countB > >>>>> > >>>>> order by $t2.tweetid > >>>>> > >>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB}; > >>>>> > >>>>> Thank you. Any opinion would be appreciated before I finalize this > fix. > >>>>> > >>>>> Best, > >>>>> Taewoo > >>>>> > >>>>> ---------- Forwarded message ---------- > >>>>> From: Yingyi Bu (JIRA) <[email protected]> > >>>>> Date: Wed, Jan 6, 2016 at 10:23 AM > >>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses > >>>>> > >>>> wrong > >>>> > >>>>> probe/index branch > >>>>> To: [email protected] > >>>>> > >>>>> > >>>>> > >>>>> [ > >>>>> > >>>>> > >>>>> > >>>> > https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992 > >>>> > >>>>> ] > >>>>> > >>>>> Yingyi Bu commented on ASTERIXDB-1249: > >>>>> -------------------------------------- > >>>>> > >>>>> That's right. Basically in AcessMethod implementations, e.g.: > >>>>> > >>>>> > >>>>> > >>>> > https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124 > >>>> > >>>>> Choosing probe/index branch is only based on dataset names, instead > of > >>>>> being based on the join condition. > >>>>> > >>>>> Self index join chooses wrong probe/index branch > >>>>>> ------------------------------------------------ > >>>>>> > >>>>>> Key: ASTERIXDB-1249 > >>>>>> URL: > >>>>>> > >>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249 > >>>>> > >>>>>> Project: Apache AsterixDB > >>>>>> Issue Type: Bug > >>>>>> Components: Optimizer > >>>>>> Reporter: Yingyi Bu > >>>>>> Assignee: Taewoo Kim > >>>>>> > >>>>>> DDLs: > >>>>>> {noformat} > >>>>>> drop dataverse test if exists; > >>>>>> create dataverse test; > >>>>>> use dataverse test; > >>>>>> create type TwitterUserType as closed { > >>>>>> screen-name: string, > >>>>>> lang: string, > >>>>>> friends-count: int64, > >>>>>> statuses-count: int64, > >>>>>> name: string, > >>>>>> followers-count: int64 > >>>>>> } > >>>>>> create type TweetMessageType as closed { > >>>>>> tweetid: int64, > >>>>>> user: TwitterUserType, > >>>>>> sender-location: point, > >>>>>> send-time: datetime, > >>>>>> referred-topics: {{ string }}, > >>>>>> message-text: string, > >>>>>> countA: int64, > >>>>>> countB: int64 > >>>>>> } > >>>>>> create dataset TweetMessages(TweetMessageType) > >>>>>> primary key tweetid; > >>>>>> create index twmSndLocIx on TweetMessages(sender-location) type > rtree; > >>>>>> create index msgCountAIx on TweetMessages(countA) type btree; > >>>>>> create index msgCountBIx on TweetMessages(countB) type btree; > >>>>>> create index msgTextIx on TweetMessages(message-text) type keyword; > >>>>>> {noformat} > >>>>>> Query: > >>>>>> {noformat} > >>>>>> for $t1 in dataset('TweetMessages') > >>>>>> for $t2 in dataset('TweetMessages') > >>>>>> let $n := create-circle($t1.sender-location, 0.5) > >>>>>> where spatial-intersect($t2.sender-location, $n) > >>>>>> order by $t2.tweetid > >>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location}; > >>>>>> {noformat} > >>>>>> Optimized plan: > >>>>>> {noformat} > >>>>>> distribute result [%0->$$10] > >>>>>> -- DISTRIBUTE_RESULT |PARTITIONED| > >>>>>> exchange > >>>>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| > >>>>>> project ([$$10]) > >>>>>> -- STREAM_PROJECT |PARTITIONED| > >>>>>> assign [$$10] <- [function-call: > >>>>>> > >>>>> asterix:closed-record-constructor, > >>>> > >>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]] > >>>>> > >>>>>> -- ASSIGN |PARTITIONED| > >>>>>> exchange > >>>>>> -- SORT_MERGE_EXCHANGE [$$15(ASC) ] |PARTITIONED| > >>>>>> order (ASC, %0->$$15) > >>>>>> -- STABLE_SORT [$$15(ASC)] |PARTITIONED| > >>>>>> exchange > >>>>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| > >>>>>> project ([$$13, $$15]) > >>>>>> -- STREAM_PROJECT |PARTITIONED| > >>>>>> select (function-call: asterix:spatial-intersect, > >>>>>> > >>>>> Args:[%0->$$13, function-call: asterix:create-circle, > >>>>> > >>>> Args:[function-call: > >>>> > >>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble: > >>>>> {0.5}]]) > >>>>> > >>>>>> -- STREAM_SELECT |PARTITIONED| > >>>>>> project ([$$0, $$13, $$15]) > >>>>>> -- STREAM_PROJECT |PARTITIONED| > >>>>>> exchange > >>>>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| > >>>>>> unnest-map [$$14, $$0] <- function-call: > >>>>>> > >>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0}, > >>>>> > >>>> AString: > >>>> > >>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: > {false}, > >>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, > TRUE, > >>>>> TRUE, TRUE] > >>>>> > >>>>>> -- BTREE_SEARCH |PARTITIONED| > >>>>>> exchange > >>>>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| > >>>>>> order (ASC, %0->$$27) > >>>>>> -- STABLE_SORT [$$27(ASC)] |PARTITIONED| > >>>>>> exchange > >>>>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| > >>>>>> project ([$$27, $$13, $$15]) > >>>>>> -- STREAM_PROJECT |PARTITIONED| > >>>>>> exchange > >>>>>> -- ONE_TO_ONE_EXCHANGE > |PARTITIONED| > >>>>>> unnest-map [$$23, $$24, $$25, > $$26, > >>>>>> > >>>>> $$27] <- function-call: asterix:index-search, Args:[AString: > >>>>> > >>>> {twmSndLocIx}, > >>>> > >>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean: > >>>>> {true}, > >>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20, > >>>>> %0->$$21, %0->$$22] > >>>>> > >>>>>> -- RTREE_SEARCH |PARTITIONED| > >>>>>> exchange > >>>>>> -- BROADCAST_EXCHANGE > >>>>>> > >>>>> |PARTITIONED| > >>>> > >>>>> assign [$$19, $$20, $$21, > $$22] > >>>>>> > >>>>> <- > >>>> > >>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, > >>>>> AInt32: > >>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, > >>>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, > >>>>> AInt32: > >>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13, > >>>>> AInt32: {2}, AInt32: {3}]] > >>>>> > >>>>>> -- ASSIGN |PARTITIONED| > >>>>>> project ([$$13, $$15]) > >>>>>> -- STREAM_PROJECT > >>>>>> > >>>>> |PARTITIONED| > >>>> > >>>>> assign [$$13] <- > >>>>>> > >>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32: > >>>>> > >>>> {2}]] > >>>> > >>>>> -- ASSIGN |PARTITIONED| > >>>>>> exchange > >>>>>> -- ONE_TO_ONE_EXCHANGE > >>>>>> > >>>>> |PARTITIONED| > >>>>> > >>>>>> data-scan []<-[$$15, > >>>>>> $$1] > >>>>>> > >>>>> <- test:TweetMessages > >>>>> > >>>>>> -- DATASOURCE_SCAN > >>>>>> > >>>>> |PARTITIONED| > >>>>> > >>>>>> exchange > >>>>>> -- > >>>>>> ONE_TO_ONE_EXCHANGE > >>>>>> > >>>>> |PARTITIONED| > >>>>> > >>>>>> > empty-tuple-source > >>>>>> -- > >>>>>> EMPTY_TUPLE_SOURCE > >>>>>> > >>>>> |PARTITIONED| > >>>>> > >>>>>> {noformat} > >>>>>> The optimized plan is incorrect --- the index search doesn't use the > >>>>>> > >>>>> right join condition and hence the result is different from expected. > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> This message was sent by Atlassian JIRA > >>>>> (v6.3.4#6332) > >>>>> > >>>>> > >> > > >
