Tried to check it. I think so. There is JoinsUtil.setJoinAlgorithmAndExchangeAlgo and isHashJoinCondition() method checks the condition. @Pouria: please correct me if I'm wrong here.
Best, Taewoo On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <[email protected]> wrote: > Also, do we (separately want to make sure that our hash join behavior is > comparable - i.e., that the initial build/probe decision is statically > determined from the query? (I think we do want that, and I think it is in > fact that way - but I'm not 100% sure, as its been awhile since that was > discussed, and it's not in-cache for me. :-)) > > > On 1/7/16 3:22 PM, Taewoo Kim wrote: > >> Thanks Yingyi. Yes. If there is an equality condition and if we can't >> transform a join into an index-nested loop join, then a hybrid hash join >> will be used. >> >> Best, >> Taewoo >> >> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <[email protected]> wrote: >> >> +1! >>> >>>> 3. We only try to use applicable indexes from the inner branch. So, if >>>>> there are no applicable indexes from the inner branch, we abort >>>>> transforming a join into an index-nested-loop join. >>>>> >>>> "index-nested-loop join" -> "hybrid hash join"? >>> >>> Thanks! >>> >>> Yingyi >>> >>> >>> >>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <[email protected]> wrote: >>> >>> Hello dev, >>>> >>>> Regarding this issue (ASTERIXDB-1249), I would like to make an >>>> index-join >>>> hint clarification. Let's start with an example query other than a >>>> self-join query in this issue. >>>> >>>> for $c in dataset('Customers') >>>> >>>> for $o in dataset('Orders') >>>> >>>> where $c.cid /*+ indexnl */ = $o.cid >>>> >>>> order by $c.cid, $o.oid >>>> >>>> return {"cid":$c.cid, "oid": $o.oid} >>>> >>>> Right now, in the master branch, the first dataset (e.g., Customers) >>>> becomes the outer branch and the second dataset (e.g., Orders) becomes >>>> >>> the >>> >>>> inner branch. And, when the optimizer tries to honor the given indexnl >>>> >>> hint >>> >>>> (transforming a join into an index-nested-loop join), if there are >>>> applicable indexes from the inner branch (e.g., Orders), then it is >>>> going >>>> to use one of those indexes. If there are no applicable indexes from the >>>> inner branch, it tries to use indexes from the outer branch (e.g., >>>> Customers). We are going to change the last part; we will not use >>>> indexes >>>> from the outer branch. So, the following are refined rules for >>>> >>> transforming >>> >>>> a join into an index-nested-loop join. >>>> >>>> 1. The first dataset in a join (the first parameter of the given join) >>>> becomes the outer branch. >>>> 2. The second dataset in a join (the second parameter of the given join) >>>> becomes the inner branch. >>>> 3. We only try to use applicable indexes from the inner branch. So, if >>>> there are no applicable indexes from the inner branch, we abort >>>> transforming a join into an index-nested-loop join. >>>> 4. Variable order in the given join predicate is not important. It can >>>> be >>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA. >>>> >>>> So, for the left-outer join and inner join altogether, the left subtree >>>> >>> is >>> >>>> the probing side and the right subtree is the index side. So, this can >>>> be >>>> applied to the self-join case, too just like the following query in this >>>> issue. In the following query, $t1 becomes the outer and $t2 becomes the >>>> inner. >>>> >>>> for $t1 in dataset('TweetMessages') >>>> >>>> for $t2 in dataset('TweetMessages') >>>> >>>> let $c := $t1.countA + 20 >>>> >>>> where $c /* +indexnl */= $t2.countB >>>> >>>> order by $t2.tweetid >>>> >>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB}; >>>> >>>> Thank you. Any opinion would be appreciated before I finalize this fix. >>>> >>>> Best, >>>> Taewoo >>>> >>>> ---------- Forwarded message ---------- >>>> From: Yingyi Bu (JIRA) <[email protected]> >>>> Date: Wed, Jan 6, 2016 at 10:23 AM >>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses >>>> >>> wrong >>> >>>> probe/index branch >>>> To: [email protected] >>>> >>>> >>>> >>>> [ >>>> >>>> >>>> >>> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992 >>> >>>> ] >>>> >>>> Yingyi Bu commented on ASTERIXDB-1249: >>>> -------------------------------------- >>>> >>>> That's right. Basically in AcessMethod implementations, e.g.: >>>> >>>> >>>> >>> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124 >>> >>>> Choosing probe/index branch is only based on dataset names, instead of >>>> being based on the join condition. >>>> >>>> Self index join chooses wrong probe/index branch >>>>> ------------------------------------------------ >>>>> >>>>> Key: ASTERIXDB-1249 >>>>> URL: >>>>> >>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249 >>>> >>>>> Project: Apache AsterixDB >>>>> Issue Type: Bug >>>>> Components: Optimizer >>>>> Reporter: Yingyi Bu >>>>> Assignee: Taewoo Kim >>>>> >>>>> DDLs: >>>>> {noformat} >>>>> drop dataverse test if exists; >>>>> create dataverse test; >>>>> use dataverse test; >>>>> create type TwitterUserType as closed { >>>>> screen-name: string, >>>>> lang: string, >>>>> friends-count: int64, >>>>> statuses-count: int64, >>>>> name: string, >>>>> followers-count: int64 >>>>> } >>>>> create type TweetMessageType as closed { >>>>> tweetid: int64, >>>>> user: TwitterUserType, >>>>> sender-location: point, >>>>> send-time: datetime, >>>>> referred-topics: {{ string }}, >>>>> message-text: string, >>>>> countA: int64, >>>>> countB: int64 >>>>> } >>>>> create dataset TweetMessages(TweetMessageType) >>>>> primary key tweetid; >>>>> create index twmSndLocIx on TweetMessages(sender-location) type rtree; >>>>> create index msgCountAIx on TweetMessages(countA) type btree; >>>>> create index msgCountBIx on TweetMessages(countB) type btree; >>>>> create index msgTextIx on TweetMessages(message-text) type keyword; >>>>> {noformat} >>>>> Query: >>>>> {noformat} >>>>> for $t1 in dataset('TweetMessages') >>>>> for $t2 in dataset('TweetMessages') >>>>> let $n := create-circle($t1.sender-location, 0.5) >>>>> where spatial-intersect($t2.sender-location, $n) >>>>> order by $t2.tweetid >>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location}; >>>>> {noformat} >>>>> Optimized plan: >>>>> {noformat} >>>>> distribute result [%0->$$10] >>>>> -- DISTRIBUTE_RESULT |PARTITIONED| >>>>> exchange >>>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>>> project ([$$10]) >>>>> -- STREAM_PROJECT |PARTITIONED| >>>>> assign [$$10] <- [function-call: >>>>> >>>> asterix:closed-record-constructor, >>> >>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]] >>>> >>>>> -- ASSIGN |PARTITIONED| >>>>> exchange >>>>> -- SORT_MERGE_EXCHANGE [$$15(ASC) ] |PARTITIONED| >>>>> order (ASC, %0->$$15) >>>>> -- STABLE_SORT [$$15(ASC)] |PARTITIONED| >>>>> exchange >>>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>>> project ([$$13, $$15]) >>>>> -- STREAM_PROJECT |PARTITIONED| >>>>> select (function-call: asterix:spatial-intersect, >>>>> >>>> Args:[%0->$$13, function-call: asterix:create-circle, >>>> >>> Args:[function-call: >>> >>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble: >>>> {0.5}]]) >>>> >>>>> -- STREAM_SELECT |PARTITIONED| >>>>> project ([$$0, $$13, $$15]) >>>>> -- STREAM_PROJECT |PARTITIONED| >>>>> exchange >>>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>>> unnest-map [$$14, $$0] <- function-call: >>>>> >>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0}, >>>> >>> AString: >>> >>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false}, >>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE, >>>> TRUE, TRUE] >>>> >>>>> -- BTREE_SEARCH |PARTITIONED| >>>>> exchange >>>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>>> order (ASC, %0->$$27) >>>>> -- STABLE_SORT [$$27(ASC)] |PARTITIONED| >>>>> exchange >>>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>>> project ([$$27, $$13, $$15]) >>>>> -- STREAM_PROJECT |PARTITIONED| >>>>> exchange >>>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>>> unnest-map [$$23, $$24, $$25, $$26, >>>>> >>>> $$27] <- function-call: asterix:index-search, Args:[AString: >>>> >>> {twmSndLocIx}, >>> >>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean: >>>> {true}, >>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20, >>>> %0->$$21, %0->$$22] >>>> >>>>> -- RTREE_SEARCH |PARTITIONED| >>>>> exchange >>>>> -- BROADCAST_EXCHANGE >>>>> >>>> |PARTITIONED| >>> >>>> assign [$$19, $$20, $$21, $$22] >>>>> >>>> <- >>> >>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, AInt32: >>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, >>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: >>>> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13, >>>> AInt32: {2}, AInt32: {3}]] >>>> >>>>> -- ASSIGN |PARTITIONED| >>>>> project ([$$13, $$15]) >>>>> -- STREAM_PROJECT >>>>> >>>> |PARTITIONED| >>> >>>> assign [$$13] <- >>>>> >>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32: >>>> >>> {2}]] >>> >>>> -- ASSIGN |PARTITIONED| >>>>> exchange >>>>> -- ONE_TO_ONE_EXCHANGE >>>>> >>>> |PARTITIONED| >>>> >>>>> data-scan []<-[$$15, >>>>> $$1] >>>>> >>>> <- test:TweetMessages >>>> >>>>> -- DATASOURCE_SCAN >>>>> >>>> |PARTITIONED| >>>> >>>>> exchange >>>>> -- ONE_TO_ONE_EXCHANGE >>>>> >>>> |PARTITIONED| >>>> >>>>> empty-tuple-source >>>>> -- >>>>> EMPTY_TUPLE_SOURCE >>>>> >>>> |PARTITIONED| >>>> >>>>> {noformat} >>>>> The optimized plan is incorrect --- the index search doesn't use the >>>>> >>>> right join condition and hence the result is different from expected. >>>> >>>> >>>> >>>> -- >>>> This message was sent by Atlassian JIRA >>>> (v6.3.4#6332) >>>> >>>> >
