Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Taewoo Kim Thu, 07 Jan 2016 17:59:23 -0800

Tried to check it. I think so. There
is JoinsUtil.setJoinAlgorithmAndExchangeAlgo and isHashJoinCondition()
method checks the condition. @Pouria: please correct me if I'm wrong here.


Best,
Taewoo

On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <[email protected]> wrote:

> Also, do we (separately want to make sure that our hash join behavior is
> comparable - i.e., that the initial build/probe decision is statically
> determined from the query?  (I think we do want that, and I think it is in
> fact that way - but I'm not 100% sure, as its been awhile since that was
> discussed, and it's not in-cache for me. :-))
>
>
> On 1/7/16 3:22 PM, Taewoo Kim wrote:
>
>> Thanks Yingyi. Yes. If there is an equality condition and if we can't
>> transform a join into an index-nested loop join, then a hybrid hash join
>> will be used.
>>
>> Best,
>> Taewoo
>>
>> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <[email protected]> wrote:
>>
>> +1!
>>>
>>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>>>> there are no applicable indexes from the inner branch, we abort
>>>>> transforming a join into an index-nested-loop join.
>>>>>
>>>> "index-nested-loop join" -> "hybrid hash join"?
>>>
>>> Thanks!
>>>
>>> Yingyi
>>>
>>>
>>>
>>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <[email protected]> wrote:
>>>
>>> Hello dev,
>>>>
>>>> Regarding this issue (ASTERIXDB-1249), I would like to make an
>>>> index-join
>>>> hint clarification. Let's start with an example query other than a
>>>> self-join query in this issue.
>>>>
>>>> for $c in dataset('Customers')
>>>>
>>>> for $o in dataset('Orders')
>>>>
>>>> where $c.cid /*+ indexnl */ = $o.cid
>>>>
>>>> order by $c.cid, $o.oid
>>>>
>>>> return {"cid":$c.cid, "oid": $o.oid}
>>>>
>>>> Right now, in the master branch, the first dataset (e.g., Customers)
>>>> becomes the outer branch and the second dataset (e.g., Orders) becomes
>>>>
>>> the
>>>
>>>> inner branch. And, when the optimizer tries to honor the given indexnl
>>>>
>>> hint
>>>
>>>> (transforming a join into an index-nested-loop join), if there are
>>>> applicable indexes from the inner branch (e.g., Orders), then it is
>>>> going
>>>> to use one of those indexes. If there are no applicable indexes from the
>>>> inner branch, it tries to use indexes from the outer branch (e.g.,
>>>> Customers). We are going to change the last part; we will not use
>>>> indexes
>>>> from the outer branch. So, the following are refined rules for
>>>>
>>> transforming
>>>
>>>> a join into an index-nested-loop join.
>>>>
>>>> 1. The first dataset in a join (the first parameter of the given join)
>>>> becomes the outer branch.
>>>> 2. The second dataset in a join (the second parameter of the given join)
>>>> becomes the inner branch.
>>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>>> there are no applicable indexes from the inner branch, we abort
>>>> transforming a join into an index-nested-loop join.
>>>> 4. Variable order in the given join predicate is not important. It can
>>>> be
>>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
>>>>
>>>> So, for the left-outer join and inner join altogether, the left subtree
>>>>
>>> is
>>>
>>>> the probing side and the right subtree is the index side. So, this can
>>>> be
>>>> applied to the self-join case, too just like the following query in this
>>>> issue. In the following query, $t1 becomes the outer and $t2 becomes the
>>>> inner.
>>>>
>>>> for $t1 in dataset('TweetMessages')
>>>>
>>>> for $t2 in dataset('TweetMessages')
>>>>
>>>> let $c := $t1.countA + 20
>>>>
>>>> where $c /* +indexnl */= $t2.countB
>>>>
>>>> order by $t2.tweetid
>>>>
>>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
>>>>
>>>> Thank you. Any opinion would be appreciated before I finalize this fix.
>>>>
>>>> Best,
>>>> Taewoo
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: Yingyi Bu (JIRA) <[email protected]>
>>>> Date: Wed, Jan 6, 2016 at 10:23 AM
>>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses
>>>>
>>> wrong
>>>
>>>> probe/index branch
>>>> To: [email protected]
>>>>
>>>>
>>>>
>>>>      [
>>>>
>>>>
>>>>
>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
>>>
>>>> ]
>>>>
>>>> Yingyi Bu commented on ASTERIXDB-1249:
>>>> --------------------------------------
>>>>
>>>> That's right. Basically in AcessMethod implementations, e.g.:
>>>>
>>>>
>>>>
>>> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
>>>
>>>> Choosing probe/index branch is only based on dataset names, instead of
>>>> being based on the join condition.
>>>>
>>>> Self index join chooses wrong probe/index branch
>>>>> ------------------------------------------------
>>>>>
>>>>>                  Key: ASTERIXDB-1249
>>>>>                  URL:
>>>>>
>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
>>>>
>>>>>              Project: Apache AsterixDB
>>>>>           Issue Type: Bug
>>>>>           Components: Optimizer
>>>>>             Reporter: Yingyi Bu
>>>>>             Assignee: Taewoo Kim
>>>>>
>>>>> DDLs:
>>>>> {noformat}
>>>>> drop dataverse test if exists;
>>>>> create dataverse test;
>>>>> use dataverse test;
>>>>> create type TwitterUserType as closed {
>>>>>      screen-name: string,
>>>>>      lang: string,
>>>>>      friends-count: int64,
>>>>>      statuses-count: int64,
>>>>>      name: string,
>>>>>      followers-count: int64
>>>>> }
>>>>> create type TweetMessageType as closed {
>>>>>      tweetid: int64,
>>>>>          user: TwitterUserType,
>>>>>          sender-location: point,
>>>>>      send-time: datetime,
>>>>>          referred-topics: {{ string }},
>>>>>      message-text: string,
>>>>>      countA: int64,
>>>>>      countB: int64
>>>>> }
>>>>> create dataset TweetMessages(TweetMessageType)
>>>>> primary key tweetid;
>>>>> create index twmSndLocIx on TweetMessages(sender-location) type rtree;
>>>>> create index msgCountAIx on TweetMessages(countA) type btree;
>>>>> create index msgCountBIx on TweetMessages(countB) type btree;
>>>>> create index msgTextIx on TweetMessages(message-text) type keyword;
>>>>> {noformat}
>>>>> Query:
>>>>> {noformat}
>>>>> for $t1 in dataset('TweetMessages')
>>>>> for $t2 in dataset('TweetMessages')
>>>>> let $n :=  create-circle($t1.sender-location, 0.5)
>>>>> where spatial-intersect($t2.sender-location, $n)
>>>>> order by $t2.tweetid
>>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
>>>>> {noformat}
>>>>> Optimized plan:
>>>>> {noformat}
>>>>> distribute result [%0->$$10]
>>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
>>>>>    exchange
>>>>>    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>      project ([$$10])
>>>>>      -- STREAM_PROJECT  |PARTITIONED|
>>>>>        assign [$$10] <- [function-call:
>>>>>
>>>> asterix:closed-record-constructor,
>>>
>>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
>>>>
>>>>>        -- ASSIGN  |PARTITIONED|
>>>>>          exchange
>>>>>          -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
>>>>>            order (ASC, %0->$$15)
>>>>>            -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
>>>>>              exchange
>>>>>              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                project ([$$13, $$15])
>>>>>                -- STREAM_PROJECT  |PARTITIONED|
>>>>>                  select (function-call: asterix:spatial-intersect,
>>>>>
>>>> Args:[%0->$$13, function-call: asterix:create-circle,
>>>>
>>> Args:[function-call:
>>>
>>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
>>>> {0.5}]])
>>>>
>>>>>                  -- STREAM_SELECT  |PARTITIONED|
>>>>>                    project ([$$0, $$13, $$15])
>>>>>                    -- STREAM_PROJECT  |PARTITIONED|
>>>>>                      exchange
>>>>>                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                        unnest-map [$$14, $$0] <- function-call:
>>>>>
>>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
>>>>
>>> AString:
>>>
>>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false},
>>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE,
>>>> TRUE, TRUE]
>>>>
>>>>>                        -- BTREE_SEARCH  |PARTITIONED|
>>>>>                          exchange
>>>>>                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                            order (ASC, %0->$$27)
>>>>>                            -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
>>>>>                              exchange
>>>>>                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                                project ([$$27, $$13, $$15])
>>>>>                                -- STREAM_PROJECT  |PARTITIONED|
>>>>>                                  exchange
>>>>>                                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>>                                    unnest-map [$$23, $$24, $$25, $$26,
>>>>>
>>>> $$27] <- function-call: asterix:index-search, Args:[AString:
>>>>
>>> {twmSndLocIx},
>>>
>>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
>>>> {true},
>>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
>>>> %0->$$21, %0->$$22]
>>>>
>>>>>                                    -- RTREE_SEARCH  |PARTITIONED|
>>>>>                                      exchange
>>>>>                                      -- BROADCAST_EXCHANGE
>>>>>
>>>> |PARTITIONED|
>>>
>>>>                                        assign [$$19, $$20, $$21, $$22]
>>>>>
>>>> <-
>>>
>>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, AInt32:
>>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
>>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
>>>> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
>>>> AInt32: {2}, AInt32: {3}]]
>>>>
>>>>>                                        -- ASSIGN  |PARTITIONED|
>>>>>                                          project ([$$13, $$15])
>>>>>                                          -- STREAM_PROJECT
>>>>>
>>>> |PARTITIONED|
>>>
>>>>                                            assign [$$13] <-
>>>>>
>>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32:
>>>>
>>> {2}]]
>>>
>>>>                                            -- ASSIGN  |PARTITIONED|
>>>>>                                              exchange
>>>>>                                              -- ONE_TO_ONE_EXCHANGE
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>>                                                data-scan []<-[$$15,
>>>>> $$1]
>>>>>
>>>> <- test:TweetMessages
>>>>
>>>>>                                                -- DATASOURCE_SCAN
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>>                                                  exchange
>>>>>                                                  -- ONE_TO_ONE_EXCHANGE
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>>                                                    empty-tuple-source
>>>>>                                                    --
>>>>> EMPTY_TUPLE_SOURCE
>>>>>
>>>> |PARTITIONED|
>>>>
>>>>> {noformat}
>>>>> The optimized plan is incorrect --- the index search doesn't use the
>>>>>
>>>> right join condition and hence the result is different from expected.
>>>>
>>>>
>>>>
>>>> --
>>>> This message was sent by Atlassian JIRA
>>>> (v6.3.4#6332)
>>>>
>>>>
>

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Reply via email to