Paul Rogers created IMPALA-8015:
-----------------------------------

             Summary: Incorrect cardinality calculation for non-FK/PK case
                 Key: IMPALA-8015
                 URL: https://issues.apache.org/jira/browse/IMPALA-8015
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 3.1.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


Please see the background information in IMPALA-8014, including notation. The 
material here is a version of that in [these 
note|https://github.com/paul-rogers/impala/wiki/Planner].

h4. Background

Impala uses two distinct ways to estimate join cardinality: FK/PK and the 
"generic" case. Both are in {{JoinNode.getJoinCardinality()}}.

The generic case handles the M:N case, multiple rows in the left table join 
with multiple rows in the right table. The {{getGenericJoinCardinality()}} 
works out the estimated cardinality.

h4. Deriving the Cardinality

Assume a join of two tables, left ({{l}}) and right ({{r}}), with no predicate. 
We have a Cartesian product:

{noformat}
|l >< r| = |l| * |r|
{noformat}

Suppose we have an equi-join predicate: {{l.a = r.b}} and that we have the NDV 
values for both columns.

Lets assume the left is the probe side. How many rows will it match? It will 
match all the rows with a given value, or {{|r 𝜎 b=x|}}, (which we'll 
abbreviate to {{|r.b=x|}} which is:

{noformat}
|r.b=x| = |r| / ndv(r.b)
{noformat}

Intuitively, the {{r.b}} keys are uniformly distributed, each left-hand row 
matches a group of right-hand rows the size of which is given by the equation 
above. If all values in r are unique, then {{|r| / ndv(r.b)}} is one and we 
match one row. If all values are the same, {{|r| / ndv(r.b) = |r|}}, so we 
match all rows (or none).

The total number of output rows, to a first approximation, is:

{noformat}
|l >< r| = |l| * |r| / ndv(r.b)
{noformat}

Next, how many rows will appear on the left (probe) side? It is the output of 
the previous join, or the scan of the table in that node, thus {{scan(l)}}.

How many rows on the right? Impala requires that the right side be a table, so 
the number is {{|scan(r)|}}. (Actually, either side can be a table scan, but 
the reasoning is symmetrical.) If we randomly sample the {{r}} table to get 
{{scan(r)}}, then the probability of getting any one row from r is:

{noformat}
p(scan(r) : row = x) = |scan(r)| / |r|
{noformat}

(Read this as the probability or a row x appearing in the scan result of r.)

Let's check the math. Using the above, the total number of scanned rows is:

{noformat}
|scan(r)| = |r| * p(scan(r) : row = x)
          = |r| * |scan(r)| / |r|
          = |scan(r)|
{noformat}

The total number of keys in each group in the scanned result is the group size 
adjusted for the probability that any one group member appears:

{noformat}
|scan(r.b)| = |r| / ndv(r.b) * p(r.b)
            = |r| / ndv(r.b) * |scan(r)| / |r|
            = |scan(r)| / ndv(r.b)
{noformat}

Intuitively, however may rows are scanned, they are still divided into the same 
set of groups.

Next we can work out the total join cardinality given the number of rows on the 
left, and the number of matching rows on the right:

{noformat}
|l >< r| = |scan(l)| * |scan(r.b)|
         = |scan(l)| * |scan(r)| / ndv(r.b)
{noformat}

The above assumes that all rows from l match rows in r. But, if there are more 
key values in l than r, some can't match, and visa-versa. What is the 
probability that a row from l will find a match? We can reason it out like this:

* If there are fewer values in l.a than in r.b, we can assume all will match.
* If there are more values in l.a than in r.b, we can assume we'll match all 
values in l.a, then discard the extra values that don't match.

The details are already worked out in IMPALA-8014. The probability is:

{noformat}
p(match) = ndv(l.a) / ndv(r.b) if ndv(l.a) < ndv(r.b),
           ndv(r.b) / ndv(l.a) otherwise.
{noformat}

Intuitively, if there are more l rows than r, then a match is the ratio between 
them and visa-versa.

Let's check. If the ndv's are equal, the probability of a match is 1. If either 
table is empty, the probability is 0. If ndv(l.a) is half that of ndv(l.b) the 
probability is 0.5 and visa versa. All good.

Putting it all together:

{noformat}
|l >< r| = |scan(l)| * |scan(r)| / ndv(r.b) * p(match)
{noformat}

For the case that {{ndv(l.a) > ndv(r.b)}}:

{noformat}
|l >< r| = |scan(l)| * (|scan(r)| / ndv(r.b)) * (ndv(r.b) / ndv(l.a))
         = |scan(l)| * |scan(r)| / ndv(l.a)
         = (|scan(l)| / ndv(l.a)) * |scan(r)|
{noformat}

Think if this as the inverse of the above: every row in r matches the set of 
rows in l with the same key value.

h4. Code Bug

The code uses the following:

{code:java}
      double lhsAdjNdv = slots.lhsNdv();
      if (slots.lhsNumRows() > lhsCard) lhsAdjNdv *= lhsCard / 
slots.lhsNumRows();
      double rhsAdjNdv = slots.rhsNdv();
      if (slots.rhsNumRows() > rhsCard) rhsAdjNdv *= rhsCard / 
slots.rhsNumRows();
      long joinCard = Math.round((lhsCard / Math.max(1, Math.max(lhsAdjNdv, 
rhsAdjNdv))) *
          rhsCard);
{code}

Where:

* {{lhsCard}} is the output cardinality of the l (previous join) table or 
{{|l|}}
* {{rhsCard}} is the output cardinality of the r (scanned) table or 
{{|scan(r)|}}
* {slots.lhsNdv()}} is the NDV of the left key or {{ndv(l.a)}}
* {slots.lnsNumRows()}} is the cardinality of the LHS table, or {{|l|}}
* {slots.rhsNdv()}} is the NDV of the right key or {{ndv(r.a)}}
* {slots.rnsNumRows()}} is the cardinality of the LHS table, or {{|r|}}

Translated to our notation:

{noformat}
adjNdv(l.a) = ndv(l.a) * ( |scan(l)| / |l| if |l| > |scan(l)|, 1 otherwise )
adjNdv(r.b) = ndv(r.b) * ( |scan(r)| / |r| if |r| > |scan(r)|, 1 otherwise )
|join| = |scan(l)| * |scan(r)| / max( adjNdv(l.a), adjNdv(r.b) )
{noformat}

Since {{|scan\(x)| <= |x|}} we can reduce the above to:

{noformat}
adjNdv(l.a) = ndv(l.a) * |scan(l)| / |l|
adjNdv(r.b) = ndv(r.b) * |scan(r)| / |r|
|join| = |scan(l)| * |scan(r)| / max( adjNdv(l.a), adjNdv(r.b) )
{noformat}

For one case:

{noformat}
|join| = |scan(l)| * |scan(r)| / adjNdv(l.a)
       = |scan(l)| * |scan(r)| / (ndv(l.a) * |scan(l)| / |l|)
       = |scan(l)| * |scan(r)| * |l| / (ndv(l.a) * |scan(l)|)
       = |scan(l)| * |scan(r)| * (|l| / |scan(l)|) / ndv(l.a)
{noformat}

Comparing the two results

{noformat}
|l >< r|  = |scan(l)| * |scan(r)| / ndv(l.a)
         != |scan(l)| * |scan(r)| * (|l| / |scan(l)|) / ndv(l.a)
{noformat}

The only way we get the code's answer is if we assume {{|l| / |scan(r)| = 1}} 
which seems unlikely. That is, the code applies an unnecessary multiplication 
factor that increases as the ratio of scanned rows decreases.

So, a bug. 

h4. Worked Example

Let's check.

* {{|l| = 10000}}
* {{|r| = 100}}
* {{scan(l) = 3000}}
* {{scan(r) = 50}}
* {{ndv(l.a) = 40}}
* {{ndv(r.a) = 20}}

Let's work this out intuitively first. The join is the normal cross join with 
adjustments:

{noformat}
|l >< r| = |l| * |r| * adjustments
{noformat}

Adjustments:

* The left side scan returns only 30% of the total rows. 
* The left hand has twice the key values of the right, so we have to discard 
half of the left rows.
* Of the l keys with a match in r, it will match, on average, 100/20 = 5 rows.
* But half of r is filtered out, reducing the available matching rows.

And:

{noformat}
|l >< r| = (|l| / 2) * (|r| / 20 / 2)
         = 10000 * 30% / 2 * 100 * / 20 / 2
         = 1500 * 2.5
         = 3750
{noformat}

Using the formula derived above:

{noformat}
p(match) = ndv(r.b) / ndv(l.a) if ndv(r.b) < ndv(l.a),
           ndv(l.a) / ndv(r.b) otherwise
|l >< r| = |scan(l)| * |scan(r)| * (|r| / ndv(r.b)) * p(match)
         = |scan(l)| * |scan(r)| * (|r| / ndv(r.b)) * (ndv(r.b) / ndv(l.a))
         = |scan(l)| * |scan(r)| * |r| / ndv(l.a)
         = 3000 * 50 * 100 / 40
         = 3750
{noformat}

Which all works out.

Now let's check the code's formula:

{noformat}
|l >< r| = |scan(l)| * |scan(r)| * (|l| / |scan(l)|) / ndv(l.a)
         = (|scan(l)| / |scan(l)|) * |scan(r)| * |l| / ndv(l.a)
         = |scan(r)| * |l| / ndv(l.a)
         = 50 * 10,000 / 50
         = 10,000
{noformat}

Which does not quite work out, verifying the bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to