Paul Rogers created IMPALA-8217:
-----------------------------------
Summary: Remove scan operator exponential backoff in cardinality
calcs
Key: IMPALA-8217
URL: https://issues.apache.org/jira/browse/IMPALA-8217
Project: IMPALA
Issue Type: Improvement
Components: Frontend
Affects Versions: Impala 3.1.0
Reporter: Paul Rogers
Assignee: Paul Rogers
The planner's scan nodes use an exponential back-off to compute the joint
filter selectivity:
{code:java}
double result = 1.0;
for (int i = 0; i < selectivities.size(); ++i) {
// Exponential backoff for each selectivity multiplied into the final
result.
result *= Math.pow(selectivities.get(i), 1.0 / (double) (i + 1));
}
{code}
The reasoning is:
{code:java}
/*
* 2. Two selectivities, whether known or unknown, could be correlated.
Assuming
* independence can lead to significant underestimation.
*
* The second issue is addressed by an exponential backoff when multiplying
each
* additional selectivity into the final result.
*/
{code}
The logic makes some kind of sense, especially since Impala calculates
selectivity only for one kind of predicate: {{col = const}}, and ignores
selectivity for all others. (Actually, replaces all others with a single
combined selectivity estimate of 0.1.}}
The problem, however, is that this violates the commutative and linear models
that underly relational theory. In general, relational theory says it does not
matter if we join then filter, or filter, then join. If we have two filters, f1
and f2, it does not matter if we apply f1 in the scan and f2 after the join:
the resulting result set is the same. (Performance differs, but we ignore that
when computing cardinality.)
With exponential back-off, this no longer applies. It becomes difficult, say,
to address the issue in IMPALA-XXXX in which we need to apply filter
selectivity at the scan level, then remove it again at the join level. The
reason is that the cardinality applied at the scan level may have been backed
off so simply using the expression's report of its cardinality may not be the
value used by the scan.
Also, oddly, the same expression could apply different selectivity in different
scans. The code sorts selectivities in ascending order. So, in scan 1, filter
f1 might apply its full selectivity, while in scan 2 (assuming a correlated
filter), it might apply the square or cube of the selectivity.
h4. Solution
The solution here is simply to:
* Compute the selectivity for all predicates so we don't have to fudge.
* Remove correlated predicates from join calcs (See IMPALA-XXXX).
* Remove exponential back-off calculations so that filter selectivity can be
properly decomposed for the above task.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)