Paul Rogers created IMPALA-8213:
-----------------------------------

             Summary: Planner does not adjust join cardinality for correlated 
filters
                 Key: IMPALA-8213
                 URL: https://issues.apache.org/jira/browse/IMPALA-8213
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 3.1.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


Consider a query such as the following:

{noformat}
SELECT *
FROM t1, t1
WHERE t1.k1 = t2.k2
  AND t1.k1 = 10
{noformat}

The planner correctly identifies the join equivalence between columns {{t1.k1}} 
and {{t2.k2}}, and pushes the {{t1.k1 = 10}} filter to the scans for both 
{{t1}} and {{t2}}.

However, when computing the join, we use the scan cardinality for both tables. 
This means we use the filter twice:

{noformat}
|T1'| = |T1| * sel(t1.k1 = 10)

|T2'| = |T2| * sel(t2.k2 = 10)

sel(t1.k1 = 10) = sel(t2.k2 = 10)


                  |T1'| * |T2'| 
|T1' ⋈ T2'| = ------------------------
               max(|T1.k1'|, |T2.k2'|)


              |T1| * |T2| * sel(t1.k1 = 10) ^ 2
|T1' ⋈ T2'| = ---------------------------------
                   max(|T1.k1'|, |T2.k2'|)
{noformat}

Where

* {{|T|}} is the cardinality of table T
* {{sel(expr)}} is the selectivity of an expression
* The equation is the standard join cardinality relation

In relational theory, it does not matter if we join first, then filter or 
filter first and then join. As a result, the form used by Impala (applying the 
filter twice) is incorrect. To see this, consider what happens if we join 
first, then filter: we apply the filter once.

The code must be modified to detect correlated filters and to back the 
correlated selectivity out of one side of the join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to