[
https://issues.apache.org/jira/browse/IMPALA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers updated IMPALA-8014:
--------------------------------
Description:
The math is wrong when we estimate the cardinality of a join we've labeled as
"FK/PK" (commonly known as many-to-one or M:1.)
h4. Join Estimation
The standard calculation for joins is explained in Swami & Schiefer, [On the
Estimation of Join Result
Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
(S&S). Note especially section 3, Background. Also see [How Good Are Query
Optimizers, Really?|http://www.vldb.org/pvldb/vol9/p204-leis.pdf] by Leis et al.
{noformat}
|L'| * |R'|
|L' ⋈ R'| = ------------------
max(|L.k'|, |R.k'|)
{noformat}
Here:
* {{|x|}} is the row cardinality if x is a relation, the NDV (domain
cardinality) if x is a column.
* {{L}}, {{R}} are left and right input relations.
* {{L.k}} and {{R.k}} are the (possibly compound) join keys.
* {{x’}} is the result of applying a filter to relation or column {{x}}.
Intuitively, the cardinality of the join is the Cartesian product reduced by
the largest column domain. Since the relations fed into the join are filtered,
we are concerned with the new, filtered relations created by the scan nodes.
Impala, like most planners, makes three basic assumptions (see the first paper
above):
* Uniformity: keys are evenly distributed.
* Independence: the filtering on the two tables is independent of keys.
* Containment: that all detail (FK) rows have a corresponding master (FK) row.
(These are traditional assumptions, but they turn out to be unrealistic in many
cases; see the second paper above.)
h4. M:1 Case
The planner (unnecessarily) divides join planning into two cases. (IMPALA-8018
describes how the two steps are unnecessary. But, to minimize code change, we
live with this process here.)
* M:1 (AKA "FK/PK", "detail/master", "fact/dimension"). A "foreign key" (FK) in
the detail table matches at most one "primary key" (PK) in the master table.
("At most one" because of filtering which may have removed master keys.) Impala
assumes that the detail ("FK") table is on the left (probe) side, and the
master ("PK") table is on the right ("build") side.
* M:N (AKA "generic", "many-to-many"). Every key on the left matches
potentially many keys on the right.
If we focus on just the M:1 case we can rename the relations for convenience
and observe a simplification:
{noformat}
L = D
R = M
L.k = M.pk
R.k = D.fk
{noformat}
By definition of M:1:
{noformat}
|M.pk'| = |M'|
{noformat}
Which gives a revised join expression:
{noformat}
|D'| * |M'|
|D' ⋈ M'| = ------------------
max(|D.fk'|, |M'|)
{noformat}
The expression above is handy: it shows that we only need three values to
compute the join cardinality:
* {{|D'|}} the output cardinality from the left plan node, which is already
available.
* {{|M'|}} the output cardinality from the right scan node, which is already
available. (In Impala, the right node will always be a table scan.)
* {{|D.fk'|}} the number of foreign key values in the filtered {{D'}} relation.
(In relational theory terms, the cardinality of the domain of the foreign key
after applying pushed-down selection operations.)
See IMPALA-XXXX for a necessary adjustment to {{|M'|}} to handle predicates
common to both sides.
We only need to estimate {{|D.fk'|}}. We will work up to this step by step
because the intermediate steps help explain the bug in the current code.
h4. Filtering of Master Table Only
Let's start with the simplest case, filtering on only the master table:
{noformat}
|D.fk'| = |D.fk| <= |M.pk|
max(|D.fk'|, |M'|) = max(|M.pk|, |M'|) = |M.pk| = |M|
{noformat}
So:
{noformat}
|D ⋈ M'| = |D| * |M'| / |M|
{noformat}
Intuitively: the probability of any detail row finding a match is simply the
selectivity of the master filter, or {{|M'| / |M|}}.
h4. Filtering on the Foreign Key Column
The next case is also fairly easy. Suppose we know that the left (detail) scan
applied a filter. Impala presently uses an incorrect, but simplified, model for
this case. (See IMPALA-XXXX for the correct model.)
The cardinality of the foreign key is simply the result of applying that filter:
{noformat}
|D.fk'| = |D.fk| * sel(f)
{noformat}
Let us assume we do not filter the master table in this case, so:
{noformat}
|D.fk'| < |D.fk| <= |M.pk|
max(|D.fk'|, |M|) = |M|
{noformat}
And:
{noformat}
|D' ⋈ M| = |D'| * |M| / |M| = |D'|
{noformat}
We assume "containment" from the S&S paper above: all the foreign keys have a
matching primary key if {{|F.fk| <= |M.pk|}}. So, if we filter only on the
detail table, all foreign keys find a match and the cardinality of the join is
just the cardinality of the left input relation.
h4. Combined Left and Right Filtering
Now, let's combine the master and detail filtering cases. In general, we will
have filtering on both sides. The {{max}} in the expression automatically
combines the cases:
{noformat}
|D.fk'| = |D.fk| * |D'| / |D|
|D'| * |M'|
|D' ⋈ M'| = ------------------
max(|D.fk'|, |M'|)
{noformat}
Intuitively, the number of rows is the Cartesian product divided by the larger
of the number of keys in either table after the scan. Said another way, each
detail row finds a master, unless filtering has removed so many masters that
some detail rows find no match, in which case the probability of a match is
{{|M'| / |D.fk'|}}.
h4. Compound Primary Keys
The final complexity is to consider a compound key. That is:
{noformat}
(D.fk1, Dfk2) --> (M.pk1, M.pk2)
{noformat}
The foreign key pair points to a matching primary key pair. Here we consider
only pairs, but the logic is the same for a compound key with any number of
columns. Obviously, by the definition of a join in SQL, the number of keys on
each side must be the same.
If we know (from HMS metadata) that the above pairs are, in fact, the keys,
then we can make a simplifying assumption:
{noformat}
|(M.pk1, M.pk2)| = |M|
|(D.fk1, Dfk2)| <= |(M.pk1, M.pk2)| = |M|
{noformat}
If we don't know a-priori that a pair (k1, k1) is a foreign or primary key,
then we can estimate its cardinality (assuming independence of values) as:
{noformat}
|(k1, k2)| = |k1| * |k2|
{noformat}
In the M:1 case, the primary key cardinality must be the same as table
cardinality. If the two columns are completely independent, then:
{noformat}
|(M.pk1, M.pk2)| = |M.pk1| * |M.pk2| = |M|
{noformat}
More typically, there is some correlation between the columns so:
{noformat}
|(M.pk1, M.pk2)| = |M| <= |M.pk1| * |M.pk2|
{noformat}
Indeed, Impala uses the above relation to decide we have the FK/PK case. Said
another way, if Impala follows the FK/PK logic, then we can simply assume that
{{|M.pk| = |M|}} even if the key is compound.
h4. Compound Foreign Keys
Foreign keys require a bit more thought. We could have a detail table with only
one row. If we have many detail rows, the assumption of containment says that
each detail record points to some master record, so:
{noformat}
|D.fk| <= |M.pk|
{noformat}
This lets us estimate the cardinality of a compound foreign key as:
{noformat}
|(D.fk1, D.fk2)| = min( |D.fk1| * |D.fk2|, |M| )
{noformat}
That is, if the detail table is small, the left term is a reasonable estimate
(ignoring the urn model issue). But, as the table gets larger, and begins to
include most primary keys, we know that the number of foreign keys can't be
larger than the number of primary keys, so the right term is the better
estimate.
h4. Filtering on Compound Keys
Suppose that the join inputs have filtering applied to the left (detail) table.
We discussed how to handle this fo a single column. For a compound key, we can
observe that the cardinality of the key is the product of the cardinality of
the columns, but (using the containment assumption), no larger than the
cardinality of the primary key (which is the cardinality of the master table.)
So:
{noformat}
|D.fk| = min( ∏ |D.fki|, |M| )
|D.fk'| = |D.fk| * |D'| / |D|
= min( ∏ |D.fki|, |M| ) * |D'| / |D|
{noformat}
So the final expression for join cardinality is:
{noformat}
|D.fk'| = min( ∏ |D.fki|, |M| ) * |D'| / |D|
|D'| * |M'|
|D' ⋈ M'| = -------------------
max( |D.fk'|, |M'|)
{noformat}
The first expression says that the compound foreign key cardinality is either
the product of the columns that make up the key, or the cardinality of the
primary key, whichever is less. We then adjust that amount (incorrectly) by the
percentage of the detail table that is scanned.
The second expression is just the Cartesian product divided by the largest key
cardinality: either foreign key or primary key (which is, by definition, equal
to the cardinality of the master table.)
h4. Complexity: Compound Joins
The above expression works well if the left input to a join is a base table row
which we know the original table cardinality that corresponds to the original
column NDV. The above can be used as-is in a M:N ("generic") join for a
right-side table. But, when when used on the left side, we must recall that
Impala builds left-deep join plans, so the left side may be a join. In this
case, there is no original left-side base table.
See IMPALA-XXXX for discussion of the complexities in this case. A quick and
dirty solution is to use the scan output cardinality in place of the left input
cardinality. That is:
* Determine the table that contains the join key.
* Search the left subtree for the scan for that table.
To do this, start with the left input:
* If the node is a scan node, and the scan is for the target table, return the
output cardinality of that scan.
* If the node is a join, then apply the search to *both* sides of the join.
This approach is cumbersome, and may run into complexities if the node is
something other than a scan or join. It also may underestimate if a filter is
applied at the join level.
A cleaner, tough more involved, solution is to track adjusted NDV for each
column through each operator as described in IMPALA-8220.
h4. Code Bug
The commit "IMPALA-5547: Rework FK/PK join detection", ID
[{{9f678a74269250bf5c7ae2c5e8afd93c5b3734de}}|https://github.com/apache/impala/commit/9f678a74269250bf5c7ae2c5e8afd93c5b3734de#diff-b10ccc2cdf68b236be400cde1e858a7c]
on Jun 6, 2017 reworked the FK/PK logic. It has one flaw: after determining
that we have an FK/PK (M:1) case, it then attempts to adjust the compound FK
and PK columns. This has two problems:
* It is unnecessary, as we saw above.
* The math is wrong and produces bogus estimates.
Here is the code in {{JoinNode.java}}:
{code:java}
long result = -1;
for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
// Adjust the join selectivity based on the NDV ratio to avoid
underestimating
// the cardinality if the PK side has a higher NDV than the FK side.
double ndvRatio = 1.0;
if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv();
double rhsSelectivity = Double.MIN_VALUE;
if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard / slots.rhsNumRows();
long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
if (result == -1) {
result = joinCard;
} else {
result = Math.min(result, joinCard);
}
}
// FK/PK join cardinality must be <= the lhs cardinality.
result = Math.min(result, lhsCard);
{code}
The above is hard to follow, which may account for why the bug was not caught.
Ignoring some corner cases, the logic is essentially:
{noformat}
lhsCard = |D'|
ndvRatioi = |D.fki| / |M.pki|
rhsSelectivityi = |M'| / |M|
joinCardi = lhsCard * ndvRatioi * rhsSelectivityi
= |D'| * (|D.fki| / |M.pki|) * |M'| / |M|
|join| = min(joinCardi)
= (|D'| * |M'| / |M|) * min(|D.fki| / |M.pki|)
{noformat}
Though hard to see, the above is not equivalent to the logic worked out in the
previous section. Using the correct expression from earlier sections:
{noformat}
|D'| * |M'|
|D' ⋈ M'| = ----------------------------------------------
max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
{noformat}
We can factor out the common {{|D'| * |M'|}} terms and compare:
{noformat}
min(|D.fki| / |M.pki|) 1
---------------------- != ----------------------------------------------
|M| max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
1 1
---------------------------- != ----------------------------------------------
|M| * max(|M.pki| / |D.fki|) max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
|M| * max(|M.pki| / |D.fki|) != max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
{noformat}
There is no valid operation that can convert one side to the other, so they are
unequal.
It is likely that the code's version attempts to work around issues elsewhere
in the calculations (such as ignoring some predicates, using exponential
back-off for filters, not having a good estimate for {{|D|}}, etc.)
h4. Longer-Term Fix
The above simple fix is the target of this ticket. Longer term, the code should
evolve to use a single path for both the M:1 and M:N cases since as described
in IMPALA-8018. (Both cases start with HMS data. Currently we use two paths to
arrive at the same result. IMPALA-8018 suggests we need only one path.) We
should also adopt the simple urn model as described in IMPALA-8218.
was:
The math is wrong when we estimate the cardinality of a join we've labeled as
"FK/PK" (commonly known as many-to-one or M:1.)
TL;DR: skip to the end for the proposed change. See IMPALA-8018 for the
detailed explanation of the correct math.
h4. Join Estimation in Impala
Join cardinality is estimated in {{JoinNode.computeStats()}}, and especially in
{{getJoinCardinality()}}, which implements the infamous "FK/PK" heuristics.
"The FK/PK detection logic is based on the assumption that most joins are
FK/PK. ... In the absence of relevant stats, we assume FK/PK with a join
selectivity of 1." As noted above, M:1 is more common name for "FK/PK".
During execution, the code first ensures that the larger table is on the probe
(left) side. So, {{getJoinCardinality()}} assumes that the L is the detail
table with the foreign key (FK), while the smaller build-side table is the
master table with the primary key (PK). That is L:R is M:1, joined on {{L.<FK>
= R.<PK>}}.
Next, let's translate the code's naming to the terms defined in IMPALA-8018.
A comment in the code explains the cardinality estimation model:
{noformat}
cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
{noformat}
We can check how the code implements the expression:
{code:java}
ndvRatio = slots.rhsNdv() / slots.lhsNdv();
double rhsSelectivity = rhsCard / slots.rhsNumRows();
long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
{code}
So, the comment accurately describes the code.
Let’s translate the code’s expression to the terms defined above:
{noformat}
|L'| |R'|
----^----- ----^-----
cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
{noformat}
Using:
* {{cardinality}} here means the cardinality of the join, hence {{|𝜎D ⋈ 𝜎M|}}
* {{L}} = Left = Probe table (before scan) = Detail table, hence {{D}}
* {{R}} = Right = Build table (before scan) = Master table, hence {{M}}
* {{child(0)}} is the output of the scan the left table or {{D}}, hence {{𝜎D}}
* {{child(1)}} is the output of the scan the right table or {{M}}, hence {{𝜎M}}
* {{R.d}} is the join key of the R (master) table, hence {{M.pk}}
* {{L.c}} is the join key of the L (detail) table, hence {{D.fk}}
* NDV is, as we said, the cardinality of the key column, hence {{|k|}}.
Gives:
{noformat}
|D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|). [Sic]
{noformat}
The code expression is actually overly complex. Recall that, by definition,
{{|M| = |M.pk|}}:
{noformat}
|D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|)
= |D'| * (|M'| / |D.fk|) * (|M.pk| / |M|)
= |D'| * (|M'| / |D.fk|) * (|M| / |M|)
= |D'| * |M'| / |D.fk|
{noformat}
As we’ll see, this formula is *NOT* correct.
h4. Bug in the Code
Compare IMPALA-8018, Equation 5 to the one in the code:
{noformat}
|D' ⋈ M'| = |D'| * |M'| / |D.fk| — Code
|D' ⋈ M'| = |D'| * |M'| / max(|D.fk|, |M.pk|) — Derived
{noformat}
Notice the code contains a severe error: it divides by the foreign key
cardinality, even when that cardinality is *smaller* than that of the primary
key, causing the join cardinality to be larger than the probe cardinality,
which cannot occur if we assume a master/detail (PK/FK) relationship.
To ensure we are reading all this correctly, let' work out an example. Let's
assume that our detail table has 10K rows, our master table 100 rows, and that
only 25 of the master records are referenced. Let's assume the scans remove no
rows. That is:
* {{|D'| = |D| = 10,000}}
* {{|M'| = |M| = = |M.pk| = 100}}
* {{|D.fk| = 25}}
Using Equation 5:
{noformat}
|D' ⋈ M'| = |D'| * |M'| / max( |D.fk|, |M.pk| )
= |D| * |M| / max( |D.fk|, |M.pk| )
= 10,000 * 100 / max( 100, 25 )
= 10,000 * 100 / 100
= 10,000
{noformat}
Makes sense: all 10K detail rows match some master row, though 3/4 of the
master rows are unreferenced.
Using the formula from the code:
{noformat}
cardinality = |D| * |M| / |D.fk|
= 10,000 * 100 / 25
= 10,000 * 4
= 40,000
{noformat}
As a result, the fewer of the primary keys are referenced, the more we expect
the input rows to be replicated and the greater the (mis)estimated output
cardinality. This, of course, leads to poor join selection and thus a poor plan.
Let's check this empirically by printing out the `ndvRatio` (see code snippet
above) when running some queries:
{noformat}
ndv ratio = 2.0977582147609786
ndv ratio = 3.3183831043488135
ndv ratio = 2.0977582147609786
ndv ratio = 2.0977582147609786
{noformat}
So, this seems like a real bug.
h4. Compound Keys
IMPALA-8018 discusses the correct way to handle compound keys and correlated
filters. Impala has an implementation: in {{JoinNode.getFkPkEqJoinConjuncts()}}:
{code:java}
double jointNdv = 1.0;
for (EqJoinConjunctScanSlots slots: fkPkCandidate) jointNdv *=
slots.rhsNdv();
{code}
However, the calculations become muddled in the actual cardinality estimation
in {{JoinNode.getFkPkJoinCardinality()}}
{code:java}
long result = -1;
for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
// Adjust the join selectivity based on the NDV ratio to avoid
underestimating
// the cardinality if the PK side has a higher NDV than the FK side.
double ndvRatio = 1.0;
if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv();
double rhsSelectivity = Double.MIN_VALUE;
if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard / slots.rhsNumRows();
long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
if (result == -1) {
result = joinCard;
} else {
result = Math.min(result, joinCard);
}
}
{code}
The above tries to work to the key cardinality after selection ({{|T.k'|}}),
but the math is off. Rather than using the actual selectivity from the input
node, the code attempts to use the standard formula to derive the selectivity.
Unfortunately, for join nodes, we cannot recover the original table
selectivity, so the above gives the wrong answers.
h4. Proposed Fix
All of this leads to the proposed fix, which is to implement the changes
proposed in IMPALA-8018.
> Incorrect M:1 (FK/PK) cardinality estimation
> --------------------------------------------
>
> Key: IMPALA-8014
> URL: https://issues.apache.org/jira/browse/IMPALA-8014
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
>
> The math is wrong when we estimate the cardinality of a join we've labeled as
> "FK/PK" (commonly known as many-to-one or M:1.)
> h4. Join Estimation
> The standard calculation for joins is explained in Swami & Schiefer, [On the
> Estimation of Join Result
> Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
> (S&S). Note especially section 3, Background. Also see [How Good Are Query
> Optimizers, Really?|http://www.vldb.org/pvldb/vol9/p204-leis.pdf] by Leis et
> al.
> {noformat}
> |L'| * |R'|
> |L' ⋈ R'| = ------------------
> max(|L.k'|, |R.k'|)
> {noformat}
> Here:
> * {{|x|}} is the row cardinality if x is a relation, the NDV (domain
> cardinality) if x is a column.
> * {{L}}, {{R}} are left and right input relations.
> * {{L.k}} and {{R.k}} are the (possibly compound) join keys.
> * {{x’}} is the result of applying a filter to relation or column {{x}}.
> Intuitively, the cardinality of the join is the Cartesian product reduced by
> the largest column domain. Since the relations fed into the join are
> filtered, we are concerned with the new, filtered relations created by the
> scan nodes.
> Impala, like most planners, makes three basic assumptions (see the first
> paper above):
> * Uniformity: keys are evenly distributed.
> * Independence: the filtering on the two tables is independent of keys.
> * Containment: that all detail (FK) rows have a corresponding master (FK) row.
> (These are traditional assumptions, but they turn out to be unrealistic in
> many cases; see the second paper above.)
> h4. M:1 Case
> The planner (unnecessarily) divides join planning into two cases.
> (IMPALA-8018 describes how the two steps are unnecessary. But, to minimize
> code change, we live with this process here.)
> * M:1 (AKA "FK/PK", "detail/master", "fact/dimension"). A "foreign key" (FK)
> in the detail table matches at most one "primary key" (PK) in the master
> table. ("At most one" because of filtering which may have removed master
> keys.) Impala assumes that the detail ("FK") table is on the left (probe)
> side, and the master ("PK") table is on the right ("build") side.
> * M:N (AKA "generic", "many-to-many"). Every key on the left matches
> potentially many keys on the right.
> If we focus on just the M:1 case we can rename the relations for convenience
> and observe a simplification:
> {noformat}
> L = D
> R = M
> L.k = M.pk
> R.k = D.fk
> {noformat}
> By definition of M:1:
> {noformat}
> |M.pk'| = |M'|
> {noformat}
> Which gives a revised join expression:
> {noformat}
> |D'| * |M'|
> |D' ⋈ M'| = ------------------
> max(|D.fk'|, |M'|)
> {noformat}
> The expression above is handy: it shows that we only need three values to
> compute the join cardinality:
> * {{|D'|}} the output cardinality from the left plan node, which is already
> available.
> * {{|M'|}} the output cardinality from the right scan node, which is already
> available. (In Impala, the right node will always be a table scan.)
> * {{|D.fk'|}} the number of foreign key values in the filtered {{D'}}
> relation. (In relational theory terms, the cardinality of the domain of the
> foreign key after applying pushed-down selection operations.)
> See IMPALA-XXXX for a necessary adjustment to {{|M'|}} to handle predicates
> common to both sides.
> We only need to estimate {{|D.fk'|}}. We will work up to this step by step
> because the intermediate steps help explain the bug in the current code.
> h4. Filtering of Master Table Only
> Let's start with the simplest case, filtering on only the master table:
> {noformat}
> |D.fk'| = |D.fk| <= |M.pk|
> max(|D.fk'|, |M'|) = max(|M.pk|, |M'|) = |M.pk| = |M|
> {noformat}
> So:
> {noformat}
> |D ⋈ M'| = |D| * |M'| / |M|
> {noformat}
> Intuitively: the probability of any detail row finding a match is simply the
> selectivity of the master filter, or {{|M'| / |M|}}.
> h4. Filtering on the Foreign Key Column
> The next case is also fairly easy. Suppose we know that the left (detail)
> scan applied a filter. Impala presently uses an incorrect, but simplified,
> model for this case. (See IMPALA-XXXX for the correct model.)
> The cardinality of the foreign key is simply the result of applying that
> filter:
> {noformat}
> |D.fk'| = |D.fk| * sel(f)
> {noformat}
> Let us assume we do not filter the master table in this case, so:
> {noformat}
> |D.fk'| < |D.fk| <= |M.pk|
> max(|D.fk'|, |M|) = |M|
> {noformat}
> And:
> {noformat}
> |D' ⋈ M| = |D'| * |M| / |M| = |D'|
> {noformat}
> We assume "containment" from the S&S paper above: all the foreign keys have a
> matching primary key if {{|F.fk| <= |M.pk|}}. So, if we filter only on the
> detail table, all foreign keys find a match and the cardinality of the join
> is just the cardinality of the left input relation.
> h4. Combined Left and Right Filtering
> Now, let's combine the master and detail filtering cases. In general, we will
> have filtering on both sides. The {{max}} in the expression automatically
> combines the cases:
> {noformat}
> |D.fk'| = |D.fk| * |D'| / |D|
> |D'| * |M'|
> |D' ⋈ M'| = ------------------
> max(|D.fk'|, |M'|)
> {noformat}
> Intuitively, the number of rows is the Cartesian product divided by the
> larger of the number of keys in either table after the scan. Said another
> way, each detail row finds a master, unless filtering has removed so many
> masters that some detail rows find no match, in which case the probability of
> a match is {{|M'| / |D.fk'|}}.
> h4. Compound Primary Keys
> The final complexity is to consider a compound key. That is:
> {noformat}
> (D.fk1, Dfk2) --> (M.pk1, M.pk2)
> {noformat}
> The foreign key pair points to a matching primary key pair. Here we consider
> only pairs, but the logic is the same for a compound key with any number of
> columns. Obviously, by the definition of a join in SQL, the number of keys on
> each side must be the same.
> If we know (from HMS metadata) that the above pairs are, in fact, the keys,
> then we can make a simplifying assumption:
> {noformat}
> |(M.pk1, M.pk2)| = |M|
> |(D.fk1, Dfk2)| <= |(M.pk1, M.pk2)| = |M|
> {noformat}
> If we don't know a-priori that a pair (k1, k1) is a foreign or primary key,
> then we can estimate its cardinality (assuming independence of values) as:
> {noformat}
> |(k1, k2)| = |k1| * |k2|
> {noformat}
> In the M:1 case, the primary key cardinality must be the same as table
> cardinality. If the two columns are completely independent, then:
> {noformat}
> |(M.pk1, M.pk2)| = |M.pk1| * |M.pk2| = |M|
> {noformat}
> More typically, there is some correlation between the columns so:
> {noformat}
> |(M.pk1, M.pk2)| = |M| <= |M.pk1| * |M.pk2|
> {noformat}
> Indeed, Impala uses the above relation to decide we have the FK/PK case. Said
> another way, if Impala follows the FK/PK logic, then we can simply assume
> that {{|M.pk| = |M|}} even if the key is compound.
> h4. Compound Foreign Keys
> Foreign keys require a bit more thought. We could have a detail table with
> only one row. If we have many detail rows, the assumption of containment says
> that each detail record points to some master record, so:
> {noformat}
> |D.fk| <= |M.pk|
> {noformat}
> This lets us estimate the cardinality of a compound foreign key as:
> {noformat}
> |(D.fk1, D.fk2)| = min( |D.fk1| * |D.fk2|, |M| )
> {noformat}
> That is, if the detail table is small, the left term is a reasonable estimate
> (ignoring the urn model issue). But, as the table gets larger, and begins to
> include most primary keys, we know that the number of foreign keys can't be
> larger than the number of primary keys, so the right term is the better
> estimate.
> h4. Filtering on Compound Keys
> Suppose that the join inputs have filtering applied to the left (detail)
> table. We discussed how to handle this fo a single column. For a compound
> key, we can observe that the cardinality of the key is the product of the
> cardinality of the columns, but (using the containment assumption), no larger
> than the cardinality of the primary key (which is the cardinality of the
> master table.)
> So:
> {noformat}
> |D.fk| = min( ∏ |D.fki|, |M| )
>
> |D.fk'| = |D.fk| * |D'| / |D|
> = min( ∏ |D.fki|, |M| ) * |D'| / |D|
> {noformat}
> So the final expression for join cardinality is:
> {noformat}
> |D.fk'| = min( ∏ |D.fki|, |M| ) * |D'| / |D|
> |D'| * |M'|
> |D' ⋈ M'| = -------------------
> max( |D.fk'|, |M'|)
> {noformat}
> The first expression says that the compound foreign key cardinality is either
> the product of the columns that make up the key, or the cardinality of the
> primary key, whichever is less. We then adjust that amount (incorrectly) by
> the percentage of the detail table that is scanned.
> The second expression is just the Cartesian product divided by the largest
> key cardinality: either foreign key or primary key (which is, by definition,
> equal to the cardinality of the master table.)
> h4. Complexity: Compound Joins
> The above expression works well if the left input to a join is a base table
> row which we know the original table cardinality that corresponds to the
> original column NDV. The above can be used as-is in a M:N ("generic") join
> for a right-side table. But, when when used on the left side, we must recall
> that Impala builds left-deep join plans, so the left side may be a join. In
> this case, there is no original left-side base table.
> See IMPALA-XXXX for discussion of the complexities in this case. A quick and
> dirty solution is to use the scan output cardinality in place of the left
> input cardinality. That is:
> * Determine the table that contains the join key.
> * Search the left subtree for the scan for that table.
> To do this, start with the left input:
> * If the node is a scan node, and the scan is for the target table, return
> the output cardinality of that scan.
> * If the node is a join, then apply the search to *both* sides of the join.
> This approach is cumbersome, and may run into complexities if the node is
> something other than a scan or join. It also may underestimate if a filter is
> applied at the join level.
> A cleaner, tough more involved, solution is to track adjusted NDV for each
> column through each operator as described in IMPALA-8220.
> h4. Code Bug
> The commit "IMPALA-5547: Rework FK/PK join detection", ID
> [{{9f678a74269250bf5c7ae2c5e8afd93c5b3734de}}|https://github.com/apache/impala/commit/9f678a74269250bf5c7ae2c5e8afd93c5b3734de#diff-b10ccc2cdf68b236be400cde1e858a7c]
> on Jun 6, 2017 reworked the FK/PK logic. It has one flaw: after determining
> that we have an FK/PK (M:1) case, it then attempts to adjust the compound FK
> and PK columns. This has two problems:
> * It is unnecessary, as we saw above.
> * The math is wrong and produces bogus estimates.
> Here is the code in {{JoinNode.java}}:
> {code:java}
> long result = -1;
> for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
> // Adjust the join selectivity based on the NDV ratio to avoid
> underestimating
> // the cardinality if the PK side has a higher NDV than the FK side.
> double ndvRatio = 1.0;
> if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv();
> double rhsSelectivity = Double.MIN_VALUE;
> if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard /
> slots.rhsNumRows();
> long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
> if (result == -1) {
> result = joinCard;
> } else {
> result = Math.min(result, joinCard);
> }
> }
> // FK/PK join cardinality must be <= the lhs cardinality.
> result = Math.min(result, lhsCard);
> {code}
> The above is hard to follow, which may account for why the bug was not
> caught. Ignoring some corner cases, the logic is essentially:
> {noformat}
> lhsCard = |D'|
> ndvRatioi = |D.fki| / |M.pki|
> rhsSelectivityi = |M'| / |M|
> joinCardi = lhsCard * ndvRatioi * rhsSelectivityi
> = |D'| * (|D.fki| / |M.pki|) * |M'| / |M|
> |join| = min(joinCardi)
> = (|D'| * |M'| / |M|) * min(|D.fki| / |M.pki|)
> {noformat}
> Though hard to see, the above is not equivalent to the logic worked out in
> the previous section. Using the correct expression from earlier sections:
> {noformat}
> |D'| * |M'|
> |D' ⋈ M'| = ----------------------------------------------
> max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
> {noformat}
> We can factor out the common {{|D'| * |M'|}} terms and compare:
> {noformat}
> min(|D.fki| / |M.pki|) 1
> ---------------------- != ----------------------------------------------
> |M| max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
> 1 1
> ---------------------------- !=
> ----------------------------------------------
> |M| * max(|M.pki| / |D.fki|) max(min( ∏ |D.fki|, |M| ) * |D'| / |D|),
> |M'|)
> |M| * max(|M.pki| / |D.fki|) != max(min( ∏ |D.fki|, |M| ) * |D'| / |D|),
> |M'|)
> {noformat}
> There is no valid operation that can convert one side to the other, so they
> are unequal.
> It is likely that the code's version attempts to work around issues elsewhere
> in the calculations (such as ignoring some predicates, using exponential
> back-off for filters, not having a good estimate for {{|D|}}, etc.)
> h4. Longer-Term Fix
> The above simple fix is the target of this ticket. Longer term, the code
> should evolve to use a single path for both the M:1 and M:N cases since as
> described in IMPALA-8018. (Both cases start with HMS data. Currently we use
> two paths to arrive at the same result. IMPALA-8018 suggests we need only one
> path.) We should also adopt the simple urn model as described in IMPALA-8218.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]