Paul Rogers created IMPALA-8213:
-----------------------------------
Summary: Planner does not adjust join cardinality for correlated
filters
Key: IMPALA-8213
URL: https://issues.apache.org/jira/browse/IMPALA-8213
Project: IMPALA
Issue Type: Bug
Components: Frontend
Affects Versions: Impala 3.1.0
Reporter: Paul Rogers
Assignee: Paul Rogers
Consider a query such as the following:
{noformat}
SELECT *
FROM t1, t1
WHERE t1.k1 = t2.k2
AND t1.k1 = 10
{noformat}
The planner correctly identifies the join equivalence between columns {{t1.k1}}
and {{t2.k2}}, and pushes the {{t1.k1 = 10}} filter to the scans for both
{{t1}} and {{t2}}.
However, when computing the join, we use the scan cardinality for both tables.
This means we use the filter twice:
{noformat}
|T1'| = |T1| * sel(t1.k1 = 10)
|T2'| = |T2| * sel(t2.k2 = 10)
sel(t1.k1 = 10) = sel(t2.k2 = 10)
|T1'| * |T2'|
|T1' ⋈ T2'| = ------------------------
max(|T1.k1'|, |T2.k2'|)
|T1| * |T2| * sel(t1.k1 = 10) ^ 2
|T1' ⋈ T2'| = ---------------------------------
max(|T1.k1'|, |T2.k2'|)
{noformat}
Where
* {{|T|}} is the cardinality of table T
* {{sel(expr)}} is the selectivity of an expression
* The equation is the standard join cardinality relation
In relational theory, it does not matter if we join first, then filter or
filter first and then join. As a result, the form used by Impala (applying the
filter twice) is incorrect. To see this, consider what happens if we join
first, then filter: we apply the filter once.
The code must be modified to detect correlated filters and to back the
correlated selectivity out of one side of the join.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)