[
https://issues.apache.org/jira/browse/IMPALA-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers reassigned IMPALA-8217:
-----------------------------------
Assignee: (was: Paul Rogers)
> Remove scan operator exponential backoff in cardinality calcs
> -------------------------------------------------------------
>
> Key: IMPALA-8217
> URL: https://issues.apache.org/jira/browse/IMPALA-8217
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Priority: Major
>
> The planner's scan nodes use an exponential back-off to compute the joint
> filter selectivity:
> {code:java}
> double result = 1.0;
> for (int i = 0; i < selectivities.size(); ++i) {
> // Exponential backoff for each selectivity multiplied into the final
> result.
> result *= Math.pow(selectivities.get(i), 1.0 / (double) (i + 1));
> }
> {code}
> The reasoning is:
> {code:java}
> /*
> * 2. Two selectivities, whether known or unknown, could be correlated.
> Assuming
> * independence can lead to significant underestimation.
> *
> * The second issue is addressed by an exponential backoff when multiplying
> each
> * additional selectivity into the final result.
> */
> {code}
> The logic makes some kind of sense, especially since Impala calculates
> selectivity only for one kind of predicate: {{col = const}}, and ignores
> selectivity for all others. (Actually, replaces all others with a single
> combined selectivity estimate of 0.1.}}
> The problem, however, is that this violates the commutative and linear models
> that underly relational theory. In general, relational theory says it does
> not matter if we join then filter, or filter, then join. If we have two
> filters, f1 and f2, it does not matter if we apply f1 in the scan and f2
> after the join: the resulting result set is the same. (Performance differs,
> but we ignore that when computing cardinality.)
> With exponential back-off, this no longer applies. It becomes difficult, say,
> to address the issue in IMPALA-XXXX in which we need to apply filter
> selectivity at the scan level, then remove it again at the join level. The
> reason is that the cardinality applied at the scan level may have been backed
> off so simply using the expression's report of its cardinality may not be the
> value used by the scan.
> Also, oddly, the same expression could apply different selectivity in
> different scans. The code sorts selectivities in ascending order. So, in scan
> 1, filter f1 might apply its full selectivity, while in scan 2 (assuming a
> correlated filter), it might apply the square or cube of the selectivity.
> h4. Solution
> The solution here is simply to:
> * Compute the selectivity for all predicates so we don't have to fudge.
> * Remove correlated predicates from join calcs (See IMPALA-XXXX).
> * Remove exponential back-off calculations so that filter selectivity can be
> properly decomposed for the above task.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]