Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Mike Carey Sun, 10 Jan 2016 16:00:17 -0800

Just to be (or get) clear - the role reversal happens locally in lateriterations of the hash join, right, based on the relative sizes ofon-disk hash partitions? I think the key thing that we need to beconsistent about is which is the "basic" outer and inner - which in thecase of a parallel hash join would be which is the initial build datasetand which is the initial probe dataset.


On 1/8/16 9:08 AM, Yingyi Bu wrote:

In left-outer hash join, if the the probe branch is locally clustered (or
sorted) by a column superset of the join key,
the output will still be locally clustered.
Inner hash join couldn't maintain that because of the "role reversal"
optimization in the runtime.


Best,
Yingyi

On Thu, Jan 7, 2016 at 10:07 PM, Taewoo Kim <[email protected]> wrote:

Interesting. Can you be more specific?

Best,
Taewoo

On Thu, Jan 7, 2016 at 6:36 PM, Yingyi Bu <[email protected]> wrote:

Sorry, to be more precise:
Left-outer hash join cannot preserve all local data properties for its
probe branch (because spilling can happen) but can preserve (or

downgrade)

some when certain conditions meet.

On Thu, Jan 7, 2016 at 6:28 PM, Yingyi Bu <[email protected]> wrote:

In the logical/physical query plan, I think it is statically

determined.

However, that doesn't mean the execution is faithful to that

probe/build

decision because we have the "role reversal" optimization for inner

hash

joins:-)
(That's also why our inner hash join cannot maintain any local data
property from its probe branch, but left-outer hash can preserve that.)

Best,
Yingyi


On Thu, Jan 7, 2016 at 5:34 PM, Mike Carey <[email protected]> wrote:

Also, do we (separately want to make sure that our hash join behavior

is

comparable - i.e., that the initial build/probe decision is statically
determined from the query?  (I think we do want that, and I think it

is

in

fact that way - but I'm not 100% sure, as its been awhile since that

was

discussed, and it's not in-cache for me. :-))


On 1/7/16 3:22 PM, Taewoo Kim wrote:

Thanks Yingyi. Yes. If there is an equality condition and if we can't
transform a join into an index-nested loop join, then a hybrid hash

join

will be used.

Best,
Taewoo

On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <[email protected]>

wrote:

+1!

3. We only try to use applicable indexes from the inner branch. So,

if

there are no applicable indexes from the inner branch, we abort
transforming a join into an index-nested-loop join.

"index-nested-loop join" -> "hybrid hash join"?

Thanks!

Yingyi



On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <[email protected]>

wrote:

Hello dev,

Regarding this issue (ASTERIXDB-1249), I would like to make an
index-join
hint clarification. Let's start with an example query other than a
self-join query in this issue.

for $c in dataset('Customers')

for $o in dataset('Orders')

where $c.cid /*+ indexnl */ = $o.cid

order by $c.cid, $o.oid

return {"cid":$c.cid, "oid": $o.oid}

Right now, in the master branch, the first dataset (e.g.,

Customers)

becomes the outer branch and the second dataset (e.g., Orders)

becomes

the

inner branch. And, when the optimizer tries to honor the given

indexnl

hint

(transforming a join into an index-nested-loop join), if there are
applicable indexes from the inner branch (e.g., Orders), then it is
going
to use one of those indexes. If there are no applicable indexes

from

the
inner branch, it tries to use indexes from the outer branch (e.g.,
Customers). We are going to change the last part; we will not use
indexes
from the outer branch. So, the following are refined rules for

transforming

a join into an index-nested-loop join.

1. The first dataset in a join (the first parameter of the given

join)

becomes the outer branch.
2. The second dataset in a join (the second parameter of the given
join)
becomes the inner branch.
3. We only try to use applicable indexes from the inner branch. So,

if

there are no applicable indexes from the inner branch, we abort
transforming a join into an index-nested-loop join.
4. Variable order in the given join predicate is not important. It

can

be
either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.

So, for the left-outer join and inner join altogether, the left

subtree

is

the probing side and the right subtree is the index side. So, this

can

be
applied to the self-join case, too just like the following query in
this
issue. In the following query, $t1 becomes the outer and $t2

becomes

the
inner.

for $t1 in dataset('TweetMessages')

for $t2 in dataset('TweetMessages')

let $c := $t1.countA + 20

where $c /* +indexnl */= $t2.countB

order by $t2.tweetid

return {"tweetid2": $t2.tweetid, "count2":$t2.countB};

Thank you. Any opinion would be appreciated before I finalize this

fix.

Best,
Taewoo

---------- Forwarded message ----------
From: Yingyi Bu (JIRA) <[email protected]>
Date: Wed, Jan 6, 2016 at 10:23 AM
Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join

chooses

wrong

probe/index branch
To: [email protected]



      [

https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992

]

Yingyi Bu commented on ASTERIXDB-1249:
--------------------------------------

That's right. Basically in AcessMethod implementations, e.g.:

https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124

Choosing probe/index branch is only based on dataset names, instead

of

being based on the join condition.

Self index join chooses wrong probe/index branch

------------------------------------------------

                  Key: ASTERIXDB-1249
                  URL:

https://issues.apache.org/jira/browse/ASTERIXDB-1249

              Project: Apache AsterixDB
           Issue Type: Bug
           Components: Optimizer
             Reporter: Yingyi Bu
             Assignee: Taewoo Kim

DDLs:
{noformat}
drop dataverse test if exists;
create dataverse test;
use dataverse test;
create type TwitterUserType as closed {
      screen-name: string,
      lang: string,
      friends-count: int64,
      statuses-count: int64,
      name: string,
      followers-count: int64
}
create type TweetMessageType as closed {
      tweetid: int64,
          user: TwitterUserType,
          sender-location: point,
      send-time: datetime,
          referred-topics: {{ string }},
      message-text: string,
      countA: int64,
      countB: int64
}
create dataset TweetMessages(TweetMessageType)
primary key tweetid;
create index twmSndLocIx on TweetMessages(sender-location) type

rtree;

create index msgCountAIx on TweetMessages(countA) type btree;
create index msgCountBIx on TweetMessages(countB) type btree;
create index msgTextIx on TweetMessages(message-text) type

keyword;

{noformat}
Query:
{noformat}
for $t1 in dataset('TweetMessages')
for $t2 in dataset('TweetMessages')
let $n :=  create-circle($t1.sender-location, 0.5)
where spatial-intersect($t2.sender-location, $n)
order by $t2.tweetid
return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
{noformat}
Optimized plan:
{noformat}
distribute result [%0->$$10]
-- DISTRIBUTE_RESULT  |PARTITIONED|
    exchange
    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
      project ([$$10])
      -- STREAM_PROJECT  |PARTITIONED|
        assign [$$10] <- [function-call:

asterix:closed-record-constructor,
Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]

        -- ASSIGN  |PARTITIONED|
          exchange
          -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
            order (ASC, %0->$$15)
            -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
              exchange
              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                project ([$$13, $$15])
                -- STREAM_PROJECT  |PARTITIONED|
                  select (function-call: asterix:spatial-intersect,

Args:[%0->$$13, function-call: asterix:create-circle,

Args:[function-call:

asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}],

ADouble:

{0.5}]])

                  -- STREAM_SELECT  |PARTITIONED|
                    project ([$$0, $$13, $$15])
                    -- STREAM_PROJECT  |PARTITIONED|
                      exchange
                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                        unnest-map [$$14, $$0] <- function-call:

asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},

AString:

{test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean:

{false},

ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27,

TRUE,

TRUE, TRUE]

                        -- BTREE_SEARCH  |PARTITIONED|
                          exchange
                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                            order (ASC, %0->$$27)
                            -- STABLE_SORT [$$27(ASC)]

|PARTITIONED|

                              exchange
                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                project ([$$27, $$13, $$15])
                                -- STREAM_PROJECT  |PARTITIONED|
                                  exchange
                                  -- ONE_TO_ONE_EXCHANGE

|PARTITIONED|

                                    unnest-map [$$23, $$24, $$25,

$$26,

$$27] <- function-call: asterix:index-search, Args:[AString:

{twmSndLocIx},

AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean:
{true},
ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19,

%0->$$20,

%0->$$21, %0->$$22]

                                    -- RTREE_SEARCH  |PARTITIONED|
                                      exchange
                                      -- BROADCAST_EXCHANGE

|PARTITIONED|
                                        assign [$$19, $$20, $$21,

$$22]

<-
[function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
AInt32:
{0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:

{2},

AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13,
AInt32:
{2}, AInt32: {2}], function-call: asterix:create-mbr,

Args:[%0->$$13,

AInt32: {2}, AInt32: {3}]]

                                        -- ASSIGN  |PARTITIONED|
                                          project ([$$13, $$15])
                                          -- STREAM_PROJECT

|PARTITIONED|
                                            assign [$$13] <-
[function-call: asterix:field-access-by-index, Args:[%0->$$1,

AInt32:

{2}]]

                                            -- ASSIGN  |PARTITIONED|

                                              exchange
                                              --

ONE_TO_ONE_EXCHANGE

|PARTITIONED|

                                                data-scan

[]<-[$$15,

$$1]

<- test:TweetMessages

                                                -- DATASOURCE_SCAN

|PARTITIONED|

                                                  exchange
                                                  --
ONE_TO_ONE_EXCHANGE

|PARTITIONED|

empty-tuple-source

                                                    --
EMPTY_TUPLE_SOURCE

|PARTITIONED|

{noformat}
The optimized plan is incorrect --- the index search doesn't use

the

right join condition and hence the result is different from

expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch

Reply via email to