[
https://issues.apache.org/jira/browse/IMPALA-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers updated IMPALA-8262:
--------------------------------
Description:
Consider a subset of the plan for TPC-H query 7. (See {{tpch-all.test}} for
details.)
{noformat}
11:AGGREGATE [FINALIZE]
| output: sum(l_extendedprice * (1 - l_discount))
| group by: n1.n_name, n2.n_name, year(l_shipdate)
| row-size=58B cardinality=575.77K
|
10:HASH JOIN [INNER JOIN]
| hash predicates: c_nationkey = n2.n_nationkey
| other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR
(n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE'))
| row-size=132B cardinality=575.77K
|
|--05:SCAN HDFS [tpch.nation n2]
| row-size=21B cardinality=25
|
09:HASH JOIN [INNER JOIN]
| hash predicates: s_nationkey = n1.n_nationkey
| row-size=111B cardinality=575.77K
{noformat}
Here, we have join 09 feeding 576K rows into join 10. All 576K rows pass along
to the aggregate 11. Notice, however, that join 10 has a that picks out 2 of
the 25 countries in each of two paths. The selectivity of the filters should be
something like 2 * 2/25 = 0.16. Thus, the output cardinality of the 10 join
should be 577K * 0.16 = 92K.
The problem is that the join cardinality calculations don't consider join
filter selectivity.
It may be that this was done to handle the outer join case, in which filters
applied in the outer-side scan must be re-applied on the join. Omitting the
filters avoids duplicate accounting for the selectivity.
But, that case is special and should be handled specially as part of
IMPALA-8213. Except for correlated filters, the planner *should* apply join
filter selectivity to the join output cardinality calculations.
This error has consequences. The filter should reduce the number of rows though
the join. Because it does so, it should come early in the join tree to reduce
the set of rows processed. But, because selectivity is ignored, the planner
does not see the join as a filter, and ends up putting the join 10 at the top
of the join tree. (See the test file for the full plan.) The result is that
Impala schleps around many more rows than necessary, only to discard them near
the top of the DAG.
was:
Consider a subset of the plan for TPC-H query 7. (See {{tpch-all.test}} for
details.)
{noformat}
11:AGGREGATE [FINALIZE]
| output: sum(l_extendedprice * (1 - l_discount))
| group by: n1.n_name, n2.n_name, year(l_shipdate)
| row-size=58B cardinality=575.77K
|
10:HASH JOIN [INNER JOIN]
| hash predicates: c_nationkey = n2.n_nationkey
| other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR
(n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE'))
| row-size=132B cardinality=575.77K
|
|--05:SCAN HDFS [tpch.nation n2]
| row-size=21B cardinality=25
|
09:HASH JOIN [INNER JOIN]
| hash predicates: s_nationkey = n1.n_nationkey
| row-size=111B cardinality=575.77K
{noformat}
Here, we have join 09 feeding 576K rows into join 10. All 576K rows pass along
to the aggregate 11. Notice, however, that join 10 has a that picks out 2 of
the 25 countries in each of two paths. The selectivity of the filters should be
something like 2 * 2/25 = 0.16. Thus, the output cardinality of the 10 join
should be 577K * 0.16 = 92K.
The problem is that the join cardinality calculations don't consider join
filter selectivity.
It may be that this was done to handle the outer join case, in which filters
applied in the outer-side scan must be re-applied on the join. Omitting the
filters avoids duplicate accounting for the selectivity.
But, that case is special and should be handled specially as part of
IMPALA-8213. Except for correlated filters, the planner *should* apply join
filter selectivity to the join output cardinality calculations.
> Join cardinality not decreased by join filter selectivity
> ---------------------------------------------------------
>
> Key: IMPALA-8262
> URL: https://issues.apache.org/jira/browse/IMPALA-8262
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
>
> Consider a subset of the plan for TPC-H query 7. (See {{tpch-all.test}} for
> details.)
> {noformat}
> 11:AGGREGATE [FINALIZE]
> | output: sum(l_extendedprice * (1 - l_discount))
> | group by: n1.n_name, n2.n_name, year(l_shipdate)
> | row-size=58B cardinality=575.77K
> |
> 10:HASH JOIN [INNER JOIN]
> | hash predicates: c_nationkey = n2.n_nationkey
> | other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR
> (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE'))
> | row-size=132B cardinality=575.77K
> |
> |--05:SCAN HDFS [tpch.nation n2]
> | row-size=21B cardinality=25
> |
> 09:HASH JOIN [INNER JOIN]
> | hash predicates: s_nationkey = n1.n_nationkey
> | row-size=111B cardinality=575.77K
> {noformat}
> Here, we have join 09 feeding 576K rows into join 10. All 576K rows pass
> along to the aggregate 11. Notice, however, that join 10 has a that picks out
> 2 of the 25 countries in each of two paths. The selectivity of the filters
> should be something like 2 * 2/25 = 0.16. Thus, the output cardinality of the
> 10 join should be 577K * 0.16 = 92K.
> The problem is that the join cardinality calculations don't consider join
> filter selectivity.
> It may be that this was done to handle the outer join case, in which filters
> applied in the outer-side scan must be re-applied on the join. Omitting the
> filters avoids duplicate accounting for the selectivity.
> But, that case is special and should be handled specially as part of
> IMPALA-8213. Except for correlated filters, the planner *should* apply join
> filter selectivity to the join output cardinality calculations.
> This error has consequences. The filter should reduce the number of rows
> though the join. Because it does so, it should come early in the join tree to
> reduce the set of rows processed. But, because selectivity is ignored, the
> planner does not see the join as a filter, and ends up putting the join 10 at
> the top of the join tree. (See the test file for the full plan.) The result
> is that Impala schleps around many more rows than necessary, only to discard
> them near the top of the DAG.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]