Paul Rogers created IMPALA-8040:
-----------------------------------

             Summary: Incorrect double-accounting for WHERE on join column
                 Key: IMPALA-8040
                 URL: https://issues.apache.org/jira/browse/IMPALA-8040
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 3.1.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


When a column appears as both a WHERE clause entry and a join condition, the 
planner incorrectly reduces estimated join cardinality twice for the same 
column.

To see this, start with the simple case, join and WHERE on different columns.

{code:sql}
select m.c_custkey, d.o_custkey
from tpch.customer m,
     tpch.orders d
where m.c_custkey = d.o_custkey
  and m.c_name = 'foo'
{code}

Here, {{c_name}} is unique, so only one row matches the filter. There are, on 
average, 10 rows per customer, so joining the one customer with its orders 
should return 10 rows.

Plan:

{noformat}
02:HASH JOIN [INNER JOIN]
|  hash predicates: d.o_custkey = m.c_custkey
|  fk/pk conjuncts: d.o_custkey = m.c_custkey
|  tuple-ids=1,0 row-size=46B cardinality=16
|
|--00:SCAN HDFS [tpch.customer m]
|     partitions=1/1 files=1 size=23.08MB
|     predicates: m.c_name = 'foo'
|     tuple-ids=0 row-size=38B cardinality=1
|
01:SCAN HDFS [tpch.orders d]
   partitions=1/1 files=1 size=162.56MB
   runtime filters: RF000[bloom] -> d.o_custkey
   tuple-ids=1 row-size=8B cardinality=1500000
{noformat}

Notice the join cardinality of 16. (Should be 10, see IMPALA-8014).

Now, do basically the same thing, but select that one customer base on ID, not 
name.

{noformat}
select m.c_custkey, d.o_custkey
from tpch.customer m,
     tpch.orders d
where m.c_custkey = d.o_custkey
  and m.c_custkey = 10
{noformat}

Logically, we still get one customer and that customer still joins with its 10 
orders. But this time, we get the following plan:

{noformat}
02:HASH JOIN [INNER JOIN]
|  hash predicates: d.o_custkey = m.c_custkey
|  fk/pk conjuncts: d.o_custkey = m.c_custkey
|  tuple-ids=1,0 row-size=16B cardinality=1
|
|--00:SCAN HDFS [tpch.customer m]
|     partitions=1/1 files=1 size=23.08MB
|     predicates: m.c_custkey = CAST(10 AS BIGINT)
|     tuple-ids=0 row-size=8B cardinality=1
|
01:SCAN HDFS [tpch.orders d]
   partitions=1/1 files=1 size=162.56MB
   predicates: d.o_custkey = CAST(10 AS BIGINT)
   tuple-ids=1 row-size=8B cardinality=15
{noformat}

Notice that, in the orders scan, the planner has applied the {{m.c_custkey = 
10}} to reduce the scan count, which is good. (Though, the cardinality estimate 
is wrong, it should be 10 for TPC-H.)

Then, we use the normal join calcs to assume that those 15 rows are matched 
against the set of 150K customers. Since there is a very low probability of a 
match, we guess about 1 row.

The problem is, we are using the same column twice. If we already reduced the 
orders scan by the one ID we are looking for, we can't then assume, in the 
join, that we have the full range of IDs.

The correct join cardinality would still be 10 if we handled this case 
correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to