[ 
https://issues.apache.org/jira/browse/IMPALA-8040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned IMPALA-8040:
-----------------------------------

    Assignee:     (was: Paul Rogers)

> Incorrect double-accounting for WHERE on join column
> ----------------------------------------------------
>
>                 Key: IMPALA-8040
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8040
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> When a column appears as both a WHERE clause entry and a join condition, the 
> planner incorrectly reduces estimated join cardinality twice for the same 
> column.
> To see this, start with the simple case, join and WHERE on different columns.
> {code:sql}
> select m.c_custkey, d.o_custkey
> from tpch.customer m,
>      tpch.orders d
> where m.c_custkey = d.o_custkey
>   and m.c_name = 'foo'
> {code}
> Here, {{c_name}} is unique, so only one row matches the filter. There are, on 
> average, 10 rows per customer, so joining the one customer with its orders 
> should return 10 rows.
> Plan:
> {noformat}
> 02:HASH JOIN [INNER JOIN]
> |  hash predicates: d.o_custkey = m.c_custkey
> |  fk/pk conjuncts: d.o_custkey = m.c_custkey
> |  tuple-ids=1,0 row-size=46B cardinality=16
> |
> |--00:SCAN HDFS [tpch.customer m]
> |     partitions=1/1 files=1 size=23.08MB
> |     predicates: m.c_name = 'foo'
> |     tuple-ids=0 row-size=38B cardinality=1
> |
> 01:SCAN HDFS [tpch.orders d]
>    partitions=1/1 files=1 size=162.56MB
>    runtime filters: RF000[bloom] -> d.o_custkey
>    tuple-ids=1 row-size=8B cardinality=1500000
> {noformat}
> Notice the join cardinality of 16. (Should be 10, see IMPALA-8014).
> Now, do basically the same thing, but select that one customer base on ID, 
> not name.
> {noformat}
> select m.c_custkey, d.o_custkey
> from tpch.customer m,
>      tpch.orders d
> where m.c_custkey = d.o_custkey
>   and m.c_custkey = 10
> {noformat}
> Logically, we still get one customer and that customer still joins with its 
> 10 orders. But this time, we get the following plan:
> {noformat}
> 02:HASH JOIN [INNER JOIN]
> |  hash predicates: d.o_custkey = m.c_custkey
> |  fk/pk conjuncts: d.o_custkey = m.c_custkey
> |  tuple-ids=1,0 row-size=16B cardinality=1
> |
> |--00:SCAN HDFS [tpch.customer m]
> |     partitions=1/1 files=1 size=23.08MB
> |     predicates: m.c_custkey = CAST(10 AS BIGINT)
> |     tuple-ids=0 row-size=8B cardinality=1
> |
> 01:SCAN HDFS [tpch.orders d]
>    partitions=1/1 files=1 size=162.56MB
>    predicates: d.o_custkey = CAST(10 AS BIGINT)
>    tuple-ids=1 row-size=8B cardinality=15
> {noformat}
> Notice that, in the orders scan, the planner has applied the {{m.c_custkey = 
> 10}} to reduce the scan count, which is good. (Though, the cardinality 
> estimate is wrong, it should be 10 for TPC-H.)
> Then, we use the normal join calcs to assume that those 15 rows are matched 
> against the set of 150K customers. Since there is a very low probability of a 
> match, we guess about 1 row.
> The problem is, we are using the same column twice. If we already reduced the 
> orders scan by the one ID we are looking for, we can't then assume, in the 
> join, that we have the full range of IDs.
> The correct join cardinality would still be 10 if we handled this case 
> correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to