Paul Rogers created IMPALA-8015:
-----------------------------------
Summary: Incorrect cardinality calculation for non-FK/PK case
Key: IMPALA-8015
URL: https://issues.apache.org/jira/browse/IMPALA-8015
Project: IMPALA
Issue Type: Bug
Components: Frontend
Affects Versions: Impala 3.1.0
Reporter: Paul Rogers
Assignee: Paul Rogers
Please see the background information in IMPALA-8014, including notation. The
material here is a version of that in [these
note|https://github.com/paul-rogers/impala/wiki/Planner].
h4. Background
Impala uses two distinct ways to estimate join cardinality: FK/PK and the
"generic" case. Both are in {{JoinNode.getJoinCardinality()}}.
The generic case handles the M:N case, multiple rows in the left table join
with multiple rows in the right table. The {{getGenericJoinCardinality()}}
works out the estimated cardinality.
h4. Deriving the Cardinality
Assume a join of two tables, left ({{l}}) and right ({{r}}), with no predicate.
We have a Cartesian product:
{noformat}
|l >< r| = |l| * |r|
{noformat}
Suppose we have an equi-join predicate: {{l.a = r.b}} and that we have the NDV
values for both columns.
Lets assume the left is the probe side. How many rows will it match? It will
match all the rows with a given value, or {{|r 𝜎 b=x|}}, (which we'll
abbreviate to {{|r.b=x|}} which is:
{noformat}
|r.b=x| = |r| / ndv(r.b)
{noformat}
Intuitively, the {{r.b}} keys are uniformly distributed, each left-hand row
matches a group of right-hand rows the size of which is given by the equation
above. If all values in r are unique, then {{|r| / ndv(r.b)}} is one and we
match one row. If all values are the same, {{|r| / ndv(r.b) = |r|}}, so we
match all rows (or none).
The total number of output rows, to a first approximation, is:
{noformat}
|l >< r| = |l| * |r| / ndv(r.b)
{noformat}
Next, how many rows will appear on the left (probe) side? It is the output of
the previous join, or the scan of the table in that node, thus {{scan(l)}}.
How many rows on the right? Impala requires that the right side be a table, so
the number is {{|scan(r)|}}. (Actually, either side can be a table scan, but
the reasoning is symmetrical.) If we randomly sample the {{r}} table to get
{{scan(r)}}, then the probability of getting any one row from r is:
{noformat}
p(scan(r) : row = x) = |scan(r)| / |r|
{noformat}
(Read this as the probability or a row x appearing in the scan result of r.)
Let's check the math. Using the above, the total number of scanned rows is:
{noformat}
|scan(r)| = |r| * p(scan(r) : row = x)
= |r| * |scan(r)| / |r|
= |scan(r)|
{noformat}
The total number of keys in each group in the scanned result is the group size
adjusted for the probability that any one group member appears:
{noformat}
|scan(r.b)| = |r| / ndv(r.b) * p(r.b)
= |r| / ndv(r.b) * |scan(r)| / |r|
= |scan(r)| / ndv(r.b)
{noformat}
Intuitively, however may rows are scanned, they are still divided into the same
set of groups.
Next we can work out the total join cardinality given the number of rows on the
left, and the number of matching rows on the right:
{noformat}
|l >< r| = |scan(l)| * |scan(r.b)|
= |scan(l)| * |scan(r)| / ndv(r.b)
{noformat}
The above assumes that all rows from l match rows in r. But, if there are more
key values in l than r, some can't match, and visa-versa. What is the
probability that a row from l will find a match? We can reason it out like this:
* If there are fewer values in l.a than in r.b, we can assume all will match.
* If there are more values in l.a than in r.b, we can assume we'll match all
values in l.a, then discard the extra values that don't match.
The details are already worked out in IMPALA-8014. The probability is:
{noformat}
p(match) = ndv(l.a) / ndv(r.b) if ndv(l.a) < ndv(r.b),
ndv(r.b) / ndv(l.a) otherwise.
{noformat}
Intuitively, if there are more l rows than r, then a match is the ratio between
them and visa-versa.
Let's check. If the ndv's are equal, the probability of a match is 1. If either
table is empty, the probability is 0. If ndv(l.a) is half that of ndv(l.b) the
probability is 0.5 and visa versa. All good.
Putting it all together:
{noformat}
|l >< r| = |scan(l)| * |scan(r)| / ndv(r.b) * p(match)
{noformat}
For the case that {{ndv(l.a) > ndv(r.b)}}:
{noformat}
|l >< r| = |scan(l)| * (|scan(r)| / ndv(r.b)) * (ndv(r.b) / ndv(l.a))
= |scan(l)| * |scan(r)| / ndv(l.a)
= (|scan(l)| / ndv(l.a)) * |scan(r)|
{noformat}
Think if this as the inverse of the above: every row in r matches the set of
rows in l with the same key value.
h4. Code Bug
The code uses the following:
{code:java}
double lhsAdjNdv = slots.lhsNdv();
if (slots.lhsNumRows() > lhsCard) lhsAdjNdv *= lhsCard /
slots.lhsNumRows();
double rhsAdjNdv = slots.rhsNdv();
if (slots.rhsNumRows() > rhsCard) rhsAdjNdv *= rhsCard /
slots.rhsNumRows();
long joinCard = Math.round((lhsCard / Math.max(1, Math.max(lhsAdjNdv,
rhsAdjNdv))) *
rhsCard);
{code}
Where:
* {{lhsCard}} is the output cardinality of the l (previous join) table or
{{|l|}}
* {{rhsCard}} is the output cardinality of the r (scanned) table or
{{|scan(r)|}}
* {slots.lhsNdv()}} is the NDV of the left key or {{ndv(l.a)}}
* {slots.lnsNumRows()}} is the cardinality of the LHS table, or {{|l|}}
* {slots.rhsNdv()}} is the NDV of the right key or {{ndv(r.a)}}
* {slots.rnsNumRows()}} is the cardinality of the LHS table, or {{|r|}}
Translated to our notation:
{noformat}
adjNdv(l.a) = ndv(l.a) * ( |scan(l)| / |l| if |l| > |scan(l)|, 1 otherwise )
adjNdv(r.b) = ndv(r.b) * ( |scan(r)| / |r| if |r| > |scan(r)|, 1 otherwise )
|join| = |scan(l)| * |scan(r)| / max( adjNdv(l.a), adjNdv(r.b) )
{noformat}
Since {{|scan\(x)| <= |x|}} we can reduce the above to:
{noformat}
adjNdv(l.a) = ndv(l.a) * |scan(l)| / |l|
adjNdv(r.b) = ndv(r.b) * |scan(r)| / |r|
|join| = |scan(l)| * |scan(r)| / max( adjNdv(l.a), adjNdv(r.b) )
{noformat}
For one case:
{noformat}
|join| = |scan(l)| * |scan(r)| / adjNdv(l.a)
= |scan(l)| * |scan(r)| / (ndv(l.a) * |scan(l)| / |l|)
= |scan(l)| * |scan(r)| * |l| / (ndv(l.a) * |scan(l)|)
= |scan(l)| * |scan(r)| * (|l| / |scan(l)|) / ndv(l.a)
{noformat}
Comparing the two results
{noformat}
|l >< r| = |scan(l)| * |scan(r)| / ndv(l.a)
!= |scan(l)| * |scan(r)| * (|l| / |scan(l)|) / ndv(l.a)
{noformat}
The only way we get the code's answer is if we assume {{|l| / |scan(r)| = 1}}
which seems unlikely. That is, the code applies an unnecessary multiplication
factor that increases as the ratio of scanned rows decreases.
So, a bug.
h4. Worked Example
Let's check.
* {{|l| = 10000}}
* {{|r| = 100}}
* {{scan(l) = 3000}}
* {{scan(r) = 50}}
* {{ndv(l.a) = 40}}
* {{ndv(r.a) = 20}}
Let's work this out intuitively first. The join is the normal cross join with
adjustments:
{noformat}
|l >< r| = |l| * |r| * adjustments
{noformat}
Adjustments:
* The left side scan returns only 30% of the total rows.
* The left hand has twice the key values of the right, so we have to discard
half of the left rows.
* Of the l keys with a match in r, it will match, on average, 100/20 = 5 rows.
* But half of r is filtered out, reducing the available matching rows.
And:
{noformat}
|l >< r| = (|l| / 2) * (|r| / 20 / 2)
= 10000 * 30% / 2 * 100 * / 20 / 2
= 1500 * 2.5
= 3750
{noformat}
Using the formula derived above:
{noformat}
p(match) = ndv(r.b) / ndv(l.a) if ndv(r.b) < ndv(l.a),
ndv(l.a) / ndv(r.b) otherwise
|l >< r| = |scan(l)| * |scan(r)| * (|r| / ndv(r.b)) * p(match)
= |scan(l)| * |scan(r)| * (|r| / ndv(r.b)) * (ndv(r.b) / ndv(l.a))
= |scan(l)| * |scan(r)| * |r| / ndv(l.a)
= 3000 * 50 * 100 / 40
= 3750
{noformat}
Which all works out.
Now let's check the code's formula:
{noformat}
|l >< r| = |scan(l)| * |scan(r)| * (|l| / |scan(l)|) / ndv(l.a)
= (|scan(l)| / |scan(l)|) * |scan(r)| * |l| / ndv(l.a)
= |scan(r)| * |l| / ndv(l.a)
= 50 * 10,000 / 50
= 10,000
{noformat}
Which does not quite work out, verifying the bug.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]