[
https://issues.apache.org/jira/browse/IMPALA-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers reassigned IMPALA-8015:
-----------------------------------
Assignee: (was: Paul Rogers)
> Incorrect cardinality calculation for the generic case
> ------------------------------------------------------
>
> Key: IMPALA-8015
> URL: https://issues.apache.org/jira/browse/IMPALA-8015
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Priority: Major
>
> The expression used to calculate the cardinality for a M:N (“generic”) join
> is incorrect. Scroll to the end for the proposed fix.
> The standard calculation for joins is explained in Swami & Schiefer, [On the
> Estimation of Join Result
> Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
> (S&S). Note especially section 3, Background. Also see [How Good Are Query
> Optimizers, Really?|http://www.vldb.org/pvldb/vol9/p204-leis.pdf] by Leis et
> al.
> h4. Current Implementation
> The code uses the following:
> {code:java}
> long result = -1;
> for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
> double lhsAdjNdv = slots.lhsNdv();
> if (slots.lhsNumRows() > lhsCard) lhsAdjNdv *= lhsCard /
> slots.lhsNumRows();
> double rhsAdjNdv = slots.rhsNdv();
> if (slots.rhsNumRows() > rhsCard) rhsAdjNdv *= rhsCard /
> slots.rhsNumRows();
> long joinCard = Math.round((lhsCard / Math.max(1, Math.max(lhsAdjNdv,
> rhsAdjNdv))) *
> rhsCard);
> if (result == -1) {
> result = joinCard;
> } else {
> result = Math.min(result, joinCard);
> }
> }
> {code}
> The above is hard to follow, which is part of the problem. Basically, the
> math for each key is:
> {noformat}
> |L.ki'| = min( |L.ki| * |L'| / |L|, |L.ki| )
> = |L.ki| * min( |L'| / |L|, 1 )
> {noformat}
> This says that we reduce the left-hand key by the selectivity of the
> left-hand filter, but we always have at least one key. The logic for the
> right side is the same.
> There is a subtle error: the compound key as a whole is reduced by the
> selectivity of the filter. But, we can't reduce each key by that factor or we
> end up reducing the compound key by the selectivity to the power of the key
> count. This is not a bug here because of the wrong way we compute the overall
> join cardinaly shown below.
> The last line is correct as it reduces to:
> {noformat}
> |join| = |L'| * |R'| / max( |L.ki'|, |R.ki'| )
> {noformat}
> Where:
> * {{L}} is the relation on the left side of the join, {{R}} the right.
> * {{L'}} is the result of applying scan filters to table {{L}}. Similarly for
> {{R'}}.
> * {{|T|}} is the cardinality (number of rows) in table {{T}}.
> * {{L.k}} is the (possibly compound) key column for the left table. Similarly
> for {{R.k}}.
> * {{L.ki}} is the ith column within a compound key.
> The code is also wrong in how it combines the keys to get a final answer. The
> code produces correct answers for a single key, but incorrect answers for
> compound keys. The code says that the final join cardinality is:
> {noformat}
> |join| = min( |L'| * |R'| / max( |L.ki'|, |R.ki'| ) )
> = |L'| * |R'| / max i=1 ...( max( |L.ki'|, |R.ki'| ) )
> {noformat}
> That is, the join cardinality is given by the largest component of the
> compound key. The result of this bug that the code estimate will use NDVs
> that are too small, producing estimates which are too large, which throws off
> join selection and could degrade query performance if it leads to an
> inefficient plan.
> h4. Proposed Solution
> The correct math, worked out in IMPALA-8014, is:
> {noformat}
> ∏ |T.ki|
> |T.k'| = |T'| * min( ----------, 1 )
> |T|
> |L’| * |R’|
> |L’ ⋈ R’| = -----------------------
> max( |L.k'|, |R.k'| )
> {noformat}
> The above says that the key cardinality determined by the cardinality of the
> input relation after filtering. This is either used as is (if the compound
> key cardinality is greater than the table cardinality), or scaled by the
> ratio of compound key cardinality to table cardinality.
> Notice the subtle differences between the proposed and current solutions.
> The solution can be improved.
> * IMPALA-8018 observes that the resulting code is the same for the "FK/PK"
> and "generic" cases and proposes to unify the two.
> * IMPALA-8213 proposes a solution when expressions are correlated between the
> left and right hand inputs.
> * IMPALA-8218 says we should use a "simple urn model" to calculate the key
> estimate, not the simple linear model used above and in the code.
> * IMPALA-XXXX notes the complexity of computing the filtered table
> cardinality {{T'}} on the left side when the left side is join.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]