[
https://issues.apache.org/jira/browse/IMPALA-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy updated IMPALA-8262:
--------------------------------------
Target Version: Impala 5.0.0
> Join cardinality not decreased by join filter selectivity
> ---------------------------------------------------------
>
> Key: IMPALA-8262
> URL: https://issues.apache.org/jira/browse/IMPALA-8262
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Priority: Major
>
> Consider a subset of the plan for TPC-H query 7. (See {{tpch-all.test}} for
> details.)
> {noformat}
> 11:AGGREGATE [FINALIZE]
> | output: sum(l_extendedprice * (1 - l_discount))
> | group by: n1.n_name, n2.n_name, year(l_shipdate)
> | row-size=58B cardinality=575.77K
> |
> 10:HASH JOIN [INNER JOIN]
> | hash predicates: c_nationkey = n2.n_nationkey
> | other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR
> (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE'))
> | row-size=132B cardinality=575.77K
> |
> |--05:SCAN HDFS [tpch.nation n2]
> | row-size=21B cardinality=25
> |
> 09:HASH JOIN [INNER JOIN]
> | hash predicates: s_nationkey = n1.n_nationkey
> | row-size=111B cardinality=575.77K
> {noformat}
> Here, we have join 09 feeding 576K rows into join 10. All 576K rows pass
> along to the aggregate 11. Notice, however, that join 10 has a that picks out
> 2 of the 25 countries in each of two paths. The selectivity of the filters
> should be something like 2 * 2/25 = 0.16. Thus, the output cardinality of the
> 10 join should be 577K * 0.16 = 92K.
> The problem is that the join cardinality calculations don't consider join
> filter selectivity.
> It may be that this was done to handle the outer join case, in which filters
> applied in the outer-side scan must be re-applied on the join. Omitting the
> filters avoids duplicate accounting for the selectivity.
> But, that case is special and should be handled specially as part of
> IMPALA-8213. Except for correlated filters, the planner *should* apply join
> filter selectivity to the join output cardinality calculations.
> This error has consequences. The filter should reduce the number of rows
> though the join. Because it does so, it should come early in the join tree to
> reduce the set of rows processed. But, because selectivity is ignored, the
> planner does not see the join as a filter, and ends up putting the join 10 at
> the top of the join tree. (See the test file for the full plan.) The result
> is that Impala schleps around many more rows than necessary, only to discard
> them near the top of the DAG.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]