Paul Rogers created IMPALA-8217:
-----------------------------------

             Summary: Remove scan operator exponential backoff in cardinality 
calcs
                 Key: IMPALA-8217
                 URL: https://issues.apache.org/jira/browse/IMPALA-8217
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 3.1.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


The planner's scan nodes use an exponential back-off to compute the joint 
filter selectivity:

{code:java}
    double result = 1.0;
    for (int i = 0; i < selectivities.size(); ++i) {
      // Exponential backoff for each selectivity multiplied into the final 
result.
      result *= Math.pow(selectivities.get(i), 1.0 / (double) (i + 1));
    }
{code}

The reasoning is:

{code:java}
  /*
   * 2. Two selectivities, whether known or unknown, could be correlated. 
Assuming
   *    independence can lead to significant underestimation.
   *
   * The second issue is addressed by an exponential backoff when multiplying 
each
   * additional selectivity into the final result.
   */
{code}

The logic makes some kind of sense, especially since Impala calculates 
selectivity only for one kind of predicate: {{col = const}}, and ignores 
selectivity for all others. (Actually, replaces all others with a single 
combined selectivity estimate of 0.1.}}

The problem, however, is that this violates the commutative and linear models 
that underly relational theory. In general, relational theory says it does not 
matter if we join then filter, or filter, then join. If we have two filters, f1 
and f2, it does not matter if we apply f1 in the scan and f2 after the join: 
the resulting result set is the same. (Performance differs, but we ignore that 
when computing cardinality.)

With exponential back-off, this no longer applies. It becomes difficult, say, 
to address the issue in IMPALA-XXXX in which we need to apply filter 
selectivity at the scan level, then remove it again at the join level. The 
reason is that the cardinality applied at the scan level may have been backed 
off so simply using the expression's report of its cardinality may not be the 
value used by the scan.

Also, oddly, the same expression could apply different selectivity in different 
scans. The code sorts selectivities in ascending order. So, in scan 1, filter 
f1 might apply its full selectivity, while in scan 2 (assuming a correlated 
filter), it might apply the square or cube of the selectivity.

h4. Solution

The solution here is simply to:

* Compute the selectivity for all predicates so we don't have to fudge.
* Remove correlated predicates from join calcs (See IMPALA-XXXX).
* Remove exponential back-off calculations so that filter selectivity can be 
properly decomposed for the above task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to