Paul Rogers created IMPALA-8040:
-----------------------------------
Summary: Incorrect double-accounting for WHERE on join column
Key: IMPALA-8040
URL: https://issues.apache.org/jira/browse/IMPALA-8040
Project: IMPALA
Issue Type: Bug
Components: Frontend
Affects Versions: Impala 3.1.0
Reporter: Paul Rogers
Assignee: Paul Rogers
When a column appears as both a WHERE clause entry and a join condition, the
planner incorrectly reduces estimated join cardinality twice for the same
column.
To see this, start with the simple case, join and WHERE on different columns.
{code:sql}
select m.c_custkey, d.o_custkey
from tpch.customer m,
tpch.orders d
where m.c_custkey = d.o_custkey
and m.c_name = 'foo'
{code}
Here, {{c_name}} is unique, so only one row matches the filter. There are, on
average, 10 rows per customer, so joining the one customer with its orders
should return 10 rows.
Plan:
{noformat}
02:HASH JOIN [INNER JOIN]
| hash predicates: d.o_custkey = m.c_custkey
| fk/pk conjuncts: d.o_custkey = m.c_custkey
| tuple-ids=1,0 row-size=46B cardinality=16
|
|--00:SCAN HDFS [tpch.customer m]
| partitions=1/1 files=1 size=23.08MB
| predicates: m.c_name = 'foo'
| tuple-ids=0 row-size=38B cardinality=1
|
01:SCAN HDFS [tpch.orders d]
partitions=1/1 files=1 size=162.56MB
runtime filters: RF000[bloom] -> d.o_custkey
tuple-ids=1 row-size=8B cardinality=1500000
{noformat}
Notice the join cardinality of 16. (Should be 10, see IMPALA-8014).
Now, do basically the same thing, but select that one customer base on ID, not
name.
{noformat}
select m.c_custkey, d.o_custkey
from tpch.customer m,
tpch.orders d
where m.c_custkey = d.o_custkey
and m.c_custkey = 10
{noformat}
Logically, we still get one customer and that customer still joins with its 10
orders. But this time, we get the following plan:
{noformat}
02:HASH JOIN [INNER JOIN]
| hash predicates: d.o_custkey = m.c_custkey
| fk/pk conjuncts: d.o_custkey = m.c_custkey
| tuple-ids=1,0 row-size=16B cardinality=1
|
|--00:SCAN HDFS [tpch.customer m]
| partitions=1/1 files=1 size=23.08MB
| predicates: m.c_custkey = CAST(10 AS BIGINT)
| tuple-ids=0 row-size=8B cardinality=1
|
01:SCAN HDFS [tpch.orders d]
partitions=1/1 files=1 size=162.56MB
predicates: d.o_custkey = CAST(10 AS BIGINT)
tuple-ids=1 row-size=8B cardinality=15
{noformat}
Notice that, in the orders scan, the planner has applied the {{m.c_custkey =
10}} to reduce the scan count, which is good. (Though, the cardinality estimate
is wrong, it should be 10 for TPC-H.)
Then, we use the normal join calcs to assume that those 15 rows are matched
against the set of 150K customers. Since there is a very low probability of a
match, we guess about 1 row.
The problem is, we are using the same column twice. If we already reduced the
orders scan by the one ID we are looking for, we can't then assume, in the
join, that we have the full range of IDs.
The correct join cardinality would still be 10 if we handled this case
correctly.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]