[ 
https://issues.apache.org/jira/browse/IMPALA-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated IMPALA-8262:
--------------------------------
    Description: 
Consider a subset of the plan for TPC-H query 7. (See {{tpch-all.test}} for 
details.)

{noformat}
11:AGGREGATE [FINALIZE]
|  output: sum(l_extendedprice * (1 - l_discount))
|  group by: n1.n_name, n2.n_name, year(l_shipdate)
|  row-size=58B cardinality=575.77K
|
10:HASH JOIN [INNER JOIN]
|  hash predicates: c_nationkey = n2.n_nationkey
|  other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR 
(n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE'))
|  row-size=132B cardinality=575.77K
|
|--05:SCAN HDFS [tpch.nation n2]
|     row-size=21B cardinality=25
|
09:HASH JOIN [INNER JOIN]
|  hash predicates: s_nationkey = n1.n_nationkey
|  row-size=111B cardinality=575.77K
{noformat}

Here, we have join 09 feeding 576K rows into join 10. All 576K rows pass along 
to the aggregate 11. Notice, however, that join 10 has a that picks out 2 of 
the 25 countries in each of two paths. The selectivity of the filters should be 
something like 2 * 2/25 = 0.16. Thus, the output cardinality of the 10 join 
should be 577K * 0.16 = 92K.

The problem is that the join cardinality calculations don't consider join 
filter selectivity.

It may be that this was done to handle the outer join case, in which filters 
applied in the outer-side scan must be re-applied on the join. Omitting the 
filters avoids duplicate accounting for the selectivity.

But, that case is special and should be handled specially as part of 
IMPALA-8213. Except for correlated filters, the planner *should* apply join 
filter selectivity to the join output cardinality calculations.

This error has consequences. The filter should reduce the number of rows though 
the join. Because it does so, it should come early in the join tree to reduce 
the set of rows processed. But, because selectivity is ignored, the planner 
does not see the join as a filter, and ends up putting the join 10 at the top 
of the join tree. (See the test file for the full plan.) The result is that 
Impala schleps around many more rows than necessary, only to discard them near 
the top of the DAG.

  was:
Consider a subset of the plan for TPC-H query 7. (See {{tpch-all.test}} for 
details.)

{noformat}
11:AGGREGATE [FINALIZE]
|  output: sum(l_extendedprice * (1 - l_discount))
|  group by: n1.n_name, n2.n_name, year(l_shipdate)
|  row-size=58B cardinality=575.77K
|
10:HASH JOIN [INNER JOIN]
|  hash predicates: c_nationkey = n2.n_nationkey
|  other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR 
(n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE'))
|  row-size=132B cardinality=575.77K
|
|--05:SCAN HDFS [tpch.nation n2]
|     row-size=21B cardinality=25
|
09:HASH JOIN [INNER JOIN]
|  hash predicates: s_nationkey = n1.n_nationkey
|  row-size=111B cardinality=575.77K
{noformat}

Here, we have join 09 feeding 576K rows into join 10. All 576K rows pass along 
to the aggregate 11. Notice, however, that join 10 has a that picks out 2 of 
the 25 countries in each of two paths. The selectivity of the filters should be 
something like 2 * 2/25 = 0.16. Thus, the output cardinality of the 10 join 
should be 577K * 0.16 = 92K.

The problem is that the join cardinality calculations don't consider join 
filter selectivity.

It may be that this was done to handle the outer join case, in which filters 
applied in the outer-side scan must be re-applied on the join. Omitting the 
filters avoids duplicate accounting for the selectivity.

But, that case is special and should be handled specially as part of 
IMPALA-8213. Except for correlated filters, the planner *should* apply join 
filter selectivity to the join output cardinality calculations.


> Join cardinality not decreased by join filter selectivity
> ---------------------------------------------------------
>
>                 Key: IMPALA-8262
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8262
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> Consider a subset of the plan for TPC-H query 7. (See {{tpch-all.test}} for 
> details.)
> {noformat}
> 11:AGGREGATE [FINALIZE]
> |  output: sum(l_extendedprice * (1 - l_discount))
> |  group by: n1.n_name, n2.n_name, year(l_shipdate)
> |  row-size=58B cardinality=575.77K
> |
> 10:HASH JOIN [INNER JOIN]
> |  hash predicates: c_nationkey = n2.n_nationkey
> |  other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR 
> (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE'))
> |  row-size=132B cardinality=575.77K
> |
> |--05:SCAN HDFS [tpch.nation n2]
> |     row-size=21B cardinality=25
> |
> 09:HASH JOIN [INNER JOIN]
> |  hash predicates: s_nationkey = n1.n_nationkey
> |  row-size=111B cardinality=575.77K
> {noformat}
> Here, we have join 09 feeding 576K rows into join 10. All 576K rows pass 
> along to the aggregate 11. Notice, however, that join 10 has a that picks out 
> 2 of the 25 countries in each of two paths. The selectivity of the filters 
> should be something like 2 * 2/25 = 0.16. Thus, the output cardinality of the 
> 10 join should be 577K * 0.16 = 92K.
> The problem is that the join cardinality calculations don't consider join 
> filter selectivity.
> It may be that this was done to handle the outer join case, in which filters 
> applied in the outer-side scan must be re-applied on the join. Omitting the 
> filters avoids duplicate accounting for the selectivity.
> But, that case is special and should be handled specially as part of 
> IMPALA-8213. Except for correlated filters, the planner *should* apply join 
> filter selectivity to the join output cardinality calculations.
> This error has consequences. The filter should reduce the number of rows 
> though the join. Because it does so, it should come early in the join tree to 
> reduce the set of rows processed. But, because selectivity is ignored, the 
> planner does not see the join as a filter, and ends up putting the join 10 at 
> the top of the join tree. (See the test file for the full plan.) The result 
> is that Impala schleps around many more rows than necessary, only to discard 
> them near the top of the DAG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to