[ 
https://issues.apache.org/jira/browse/IMPALA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated IMPALA-8014:
--------------------------------
    Description: 
The math is wrong when we estimate the cardinality of a join we've labeled as 
"FK/PK" (commonly known as many-to-one or M:1.)


h4. Join Estimation

The standard calculation for joins is explained in Swami & Schiefer, [On the 
Estimation of Join Result 
Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
 (S&S). Note especially section 3, Background. Also see [How Good Are Query 
Optimizers, Really?|http://www.vldb.org/pvldb/vol9/p204-leis.pdf] by Leis et al.

{noformat}
                |L'| * |R'| 
|L' ⋈ R'| = ------------------
            max(|L.k'|, |R.k'|)
{noformat}

Here:

* {{|x|}} is the row cardinality if x is a relation, the NDV (domain 
cardinality) if x is a column.
* {{L}}, {{R}} are left and right input relations.
* {{L.k}} and {{R.k}} are the (possibly compound) join keys.
* {{x’}} is the result of applying a filter to relation or column {{x}}.

Intuitively, the cardinality of the join is the Cartesian product reduced by 
the largest column domain. Since the relations fed into the join are filtered, 
we are concerned with the new, filtered relations created by the scan nodes.

Impala, like most planners, makes three basic assumptions (see the first paper 
above):

* Uniformity: keys are evenly distributed.
* Independence: the filtering on the two tables is independent of keys.
* Containment: that all detail (FK) rows have a corresponding master (FK) row.

(These are traditional assumptions, but they turn out to be unrealistic in many 
cases; see the second paper above.)


h4. M:1 Case

The planner (unnecessarily) divides join planning into two cases. (IMPALA-8018 
describes how the two steps are unnecessary. But, to minimize code change, we 
live with this process here.)

* M:1 (AKA "FK/PK", "detail/master", "fact/dimension"). A "foreign key" (FK) in 
the detail table matches at most one "primary key" (PK) in the master table. 
("At most one" because of filtering which may have removed master keys.) Impala 
assumes that the detail ("FK") table is on the left (probe) side, and the 
master ("PK") table is on the right ("build") side.
* M:N (AKA "generic", "many-to-many"). Every key on the left matches 
potentially many keys on the right.

If we focus on just the M:1 case we can rename the relations for convenience 
and observe a simplification:

{noformat}
L = D
R = M
L.k = M.pk
R.k = D.fk
{noformat}

By definition of M:1:

{noformat}
|M.pk'| = |M'|
{noformat}

Which gives a revised join expression:

{noformat}
                |D'| * |M'| 
|D' ⋈ M'| = ------------------
            max(|D.fk'|, |M'|)
{noformat}

The expression above is handy: it shows that we only need three values to 
compute the join cardinality:

* {{|D'|}} the output cardinality from the left plan node, which is already 
available.
* {{|M'|}} the output cardinality from the right scan node, which is already 
available. (In Impala, the right node will always be a table scan.)
* {{|D.fk'|}} the number of foreign key values in the filtered {{D'}} relation. 
(In relational theory terms, the cardinality of the domain of the foreign key 
after applying pushed-down selection operations.)

See IMPALA-XXXX for a necessary adjustment to {{|M'|}} to handle predicates 
common to both sides.

We only need to estimate {{|D.fk'|}}. We will work up to this step by step 
because the intermediate steps help explain the bug in the current code.


h4. Filtering of Master Table Only

Let's start with the simplest case, filtering on only the master table:

{noformat}
|D.fk'| = |D.fk| <= |M.pk|

max(|D.fk'|, |M'|) = max(|M.pk|, |M'|) = |M.pk| = |M|
{noformat}

So:

{noformat}   
|D ⋈ M'| = |D| * |M'| / |M|
{noformat}

Intuitively: the probability of any detail row finding a match is simply the 
selectivity of the master filter, or {{|M'| / |M|}}.


h4. Filtering on the Foreign Key Column

The next case is also fairly easy. Suppose we know that the left (detail) scan 
applied a filter. Impala presently uses an incorrect, but simplified, model for 
this case. (See IMPALA-XXXX for the correct model.)

The cardinality of the foreign key is simply the result of applying that filter:

{noformat}
|D.fk'| = |D.fk| * sel(f)
{noformat}

Let us assume we do not filter the master table in this case, so:

{noformat}
|D.fk'| < |D.fk| <= |M.pk|
max(|D.fk'|, |M|) = |M|
{noformat}

And:

{noformat}
|D' ⋈ M| = |D'| * |M| / |M| = |D'|
{noformat}

We assume "containment" from the S&S paper above: all the foreign keys have a 
matching primary key if {{|F.fk| <= |M.pk|}}. So, if we filter only on the 
detail table, all foreign keys find a match and the cardinality of the join is 
just the cardinality of the left input relation.


h4. Combined Left and Right Filtering

Now, let's combine the master and detail filtering cases. In general, we will 
have filtering on both sides. The {{max}} in the expression automatically 
combines the cases:

{noformat}
|D.fk'| = |D.fk| * |D'| / |D|

                |D'| * |M'| 
|D' ⋈ M'| = ------------------
            max(|D.fk'|, |M'|)
{noformat}

Intuitively, the number of rows is the Cartesian product divided by the larger 
of the number of keys in either table after the scan. Said another way, each 
detail row finds a master, unless filtering has removed so many masters that 
some detail rows find no match, in which case the probability of a match is 
{{|M'| / |D.fk'|}}.


h4. Compound Primary Keys

The final complexity is to consider a compound key. That is:

{noformat}
(D.fk1, Dfk2) --> (M.pk1, M.pk2)
{noformat}

The foreign key pair points to a matching primary key pair. Here we consider 
only pairs, but the logic is the same for a compound key with any number of 
columns. Obviously, by the definition of a join in SQL, the number of keys on 
each side must be the same.

If we know (from HMS metadata) that the above pairs are, in fact, the keys, 
then we can make a simplifying assumption:

{noformat}
|(M.pk1, M.pk2)| = |M|

|(D.fk1, Dfk2)| <= |(M.pk1, M.pk2)| = |M|
{noformat}

If we don't know a-priori that a pair (k1, k1) is a foreign or primary key, 
then we can estimate its cardinality (assuming independence of values) as:

{noformat}
|(k1, k2)| = |k1| * |k2|
{noformat}

In the M:1 case, the primary key cardinality must be the same as table 
cardinality. If the two columns are completely independent, then:

{noformat}
|(M.pk1, M.pk2)| = |M.pk1| * |M.pk2| = |M|
{noformat}

More typically, there is some correlation between the columns so:

{noformat}
|(M.pk1, M.pk2)| = |M| <= |M.pk1| * |M.pk2|
{noformat}

Indeed, Impala uses the above relation to decide we have the FK/PK case. Said 
another way, if Impala follows the FK/PK logic, then we can simply assume that 
{{|M.pk| = |M|}} even if the key is compound.


h4. Compound Foreign Keys

Foreign keys require a bit more thought. We could have a detail table with only 
one row. If we have many detail rows, the assumption of containment says that 
each detail record points to some master record, so: 

{noformat}
|D.fk| <= |M.pk|
{noformat}

This lets us estimate the cardinality of a compound foreign key as:

{noformat}
|(D.fk1, D.fk2)| = min( |D.fk1| * |D.fk2|, |M| )
{noformat}

That is, if the detail table is small, the left term is a reasonable estimate 
(ignoring the urn model issue). But, as the table gets larger, and begins to 
include most primary keys, we know that the number of foreign keys can't be 
larger than the number of primary keys, so the right term is the better 
estimate.


h4. Filtering on Compound Keys

Suppose that the join inputs have filtering applied to the left (detail) table. 
We discussed how to handle this fo a single column. For a compound key, we can 
observe that the cardinality of the key is the product of the cardinality of 
the columns, but (using the containment assumption), no larger than the 
cardinality of the primary key (which is the cardinality of the master table.)

So:

{noformat}
|D.fk| = min( ∏ |D.fki|, |M| )
                           
|D.fk'| = |D.fk| * |D'| / |D|

        = min( ∏ |D.fki|, |M| ) * |D'| / |D|
{noformat}

So the final expression for join cardinality is:

{noformat}
|D.fk'| = min( ∏ |D.fki|, |M| ) * |D'| / |D|


                |D'| * |M'| 
|D' ⋈ M'| = -------------------
            max( |D.fk'|, |M'|)
{noformat}

The first expression says that the compound foreign key cardinality is either 
the product of the columns that make up the key, or the cardinality of the 
primary key, whichever is less. We then adjust that amount (incorrectly) by the 
percentage of the detail table that is scanned.

The second expression is just the Cartesian product divided by the largest key 
cardinality: either foreign key or primary key (which is, by definition, equal 
to the cardinality of the master table.) 


h4. Complexity: Compound Joins

The above expression works well if the left input to a join is a base table row 
which we know the original table cardinality that corresponds to the original 
column NDV. The above can be used as-is in a M:N ("generic") join for a 
right-side table. But, when when used on the left side, we must recall that 
Impala builds left-deep join plans, so the left side may be a join. In this 
case, there is no original left-side base table.

See IMPALA-XXXX for discussion of the complexities in this case. A quick and 
dirty solution is to use the scan output cardinality in place of the left input 
cardinality. That is:

* Determine the table that contains the join key.
* Search the left subtree for the scan for that table.

To do this, start with the left input:

* If the node is a scan node, and the scan is for the target table, return the 
output cardinality of that scan.
* If the node is a join, then apply the search to *both* sides of the join.

This approach is cumbersome, and may run into complexities if the node is 
something other than a scan or join. It also may underestimate if a filter is 
applied at the join level.

A cleaner, tough more involved, solution is to track adjusted NDV for each 
column through each operator as described in IMPALA-8220.


h4. Code Bug

The commit "IMPALA-5547: Rework FK/PK join detection", ID 
[{{9f678a74269250bf5c7ae2c5e8afd93c5b3734de}}|https://github.com/apache/impala/commit/9f678a74269250bf5c7ae2c5e8afd93c5b3734de#diff-b10ccc2cdf68b236be400cde1e858a7c]
 on Jun 6, 2017 reworked the FK/PK logic. It has one flaw: after determining 
that we have an FK/PK (M:1) case, it then attempts to adjust the compound FK 
and PK columns. This has two problems:

* It is unnecessary, as we saw above.
* The math is wrong and produces bogus estimates.

Here is the code in {{JoinNode.java}}:

{code:java}
    long result = -1;
    for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
      // Adjust the join selectivity based on the NDV ratio to avoid 
underestimating
      // the cardinality if the PK side has a higher NDV than the FK side.
      double ndvRatio = 1.0;
      if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv();
      double rhsSelectivity = Double.MIN_VALUE;
      if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard / slots.rhsNumRows();
      long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
      if (result == -1) {
        result = joinCard;
      } else {
        result = Math.min(result, joinCard);
      }
    }
    // FK/PK join cardinality must be <= the lhs cardinality.
    result = Math.min(result, lhsCard);
{code}

The above is hard to follow, which may account for why the bug was not caught. 
Ignoring some corner cases, the logic is essentially:

{noformat}
lhsCard = |D'|

ndvRatioi = |D.fki| / |M.pki|

rhsSelectivityi = |M'| / |M|

joinCardi = lhsCard * ndvRatioi * rhsSelectivityi

          = |D'| * (|D.fki| / |M.pki|) * |M'| / |M|

|join| = min(joinCardi)

       = (|D'| * |M'| / |M|) * min(|D.fki| / |M.pki|)
{noformat}

Though hard to see, the above is not equivalent to the logic worked out in the 
previous section. Using the correct expression from earlier sections:

{noformat}
                            |D'| * |M'| 
|D' ⋈ M'| = ----------------------------------------------
            max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)

{noformat}

We can factor out the common {{|D'| * |M'|}} terms and compare:

{noformat}
min(|D.fki| / |M.pki|)                     1
----------------------  !=  ----------------------------------------------
        |M|                 max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)


             1                                  1
----------------------------  !=  ----------------------------------------------
|M| * max(|M.pki| / |D.fki|)      max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)


|M| * max(|M.pki| / |D.fki|)  !=  max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
{noformat}

There is no valid operation that can convert one side to the other, so they are 
unequal.

It is likely that the code's version attempts to work around issues elsewhere 
in the calculations (such as ignoring some predicates, using exponential 
back-off for filters, not having a good estimate for {{|D|}}, etc.)


h4. Longer-Term Fix

The above simple fix is the target of this ticket. Longer term, the code should 
evolve to use a single path for both the M:1 and M:N cases since as described 
in IMPALA-8018. (Both cases start with HMS data. Currently we use two paths to 
arrive at the same result. IMPALA-8018 suggests we need only one path.) We 
should also adopt the simple urn model as described in IMPALA-8218.

  was:
The math is wrong when we estimate the cardinality of a join we've labeled as 
"FK/PK" (commonly known as many-to-one or M:1.)

TL;DR: skip to the end for the proposed change. See IMPALA-8018 for the 
detailed explanation of the correct math.

h4. Join Estimation in Impala

Join cardinality is estimated in {{JoinNode.computeStats()}}, and especially in 
{{getJoinCardinality()}}, which implements the infamous "FK/PK" heuristics. 
"The FK/PK detection logic is based on the assumption that most joins are 
FK/PK. ... In the absence of relevant stats, we assume FK/PK with a join 
selectivity of 1." As noted above, M:1 is more common name for "FK/PK".

During execution, the code first ensures that the larger table is on the probe 
(left) side. So, {{getJoinCardinality()}} assumes that the L is the detail 
table with the foreign key (FK), while the smaller build-side table is the 
master table with the primary key (PK). That is L:R is M:1, joined on {{L.<FK> 
= R.<PK>}}.

Next, let's translate the code's naming to the terms defined in IMPALA-8018.

A comment in the code explains the cardinality estimation model:

{noformat}
cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
{noformat}

We can check how the code implements the expression:

{code:java}
      ndvRatio = slots.rhsNdv() / slots.lhsNdv();
      double rhsSelectivity = rhsCard / slots.rhsNumRows();
      long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
{code}

So, the comment accurately describes the code.

Let’s translate the code’s expression to the terms defined above:

{noformat}
                 |L'|          |R'|
              ----^-----    ----^-----
cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
{noformat}

Using:

 * {{cardinality}} here means the cardinality of the join, hence {{|𝜎D ⋈ 𝜎M|}}
 * {{L}} = Left = Probe table (before scan) = Detail table, hence {{D}}
 * {{R}} = Right = Build table (before scan) = Master table, hence {{M}}
 * {{child(0)}} is the output of the scan the left table or {{D}}, hence {{𝜎D}}
 * {{child(1)}} is the output of the scan the right table or {{M}}, hence {{𝜎M}}
 * {{R.d}} is the join key of the R (master) table, hence {{M.pk}}
 * {{L.c}} is the join key of the L (detail) table, hence {{D.fk}}
 * NDV is, as we said, the cardinality of the key column, hence {{|k|}}.

Gives:

{noformat}
|D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|).   [Sic]
{noformat}

The code expression is actually overly complex. Recall that, by definition, 
{{|M| = |M.pk|}}:

{noformat}
|D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|)
          = |D'| * (|M'| / |D.fk|) * (|M.pk| / |M|)
          = |D'| * (|M'| / |D.fk|) * (|M| / |M|)
          = |D'| * |M'| / |D.fk|
{noformat}

As we’ll see, this formula is *NOT* correct.

h4. Bug in the Code

Compare IMPALA-8018, Equation 5 to the one in the code:

{noformat}
|D' ⋈ M'| = |D'| * |M'| / |D.fk|                — Code

|D' ⋈ M'| = |D'| * |M'| / max(|D.fk|, |M.pk|)   — Derived
{noformat}

Notice the code contains a severe error: it divides by the foreign key 
cardinality, even when that cardinality is *smaller* than that of the primary 
key, causing the join cardinality to be larger than the probe cardinality, 
which cannot occur if we assume a master/detail (PK/FK) relationship.

To ensure we are reading all this correctly, let' work out an example. Let's 
assume that our detail table has 10K rows, our master table 100 rows, and that 
only 25 of the master records are referenced. Let's assume the scans remove no 
rows. That is:

 * {{|D'| = |D| = 10,000}}
 * {{|M'| = |M| = = |M.pk| = 100}}
 * {{|D.fk| = 25}}

Using Equation 5:

{noformat}
|D' ⋈ M'| = |D'| * |M'| / max( |D.fk|, |M.pk| )
          = |D| * |M| / max( |D.fk|, |M.pk| )
          = 10,000 * 100 / max( 100, 25 )
          = 10,000 * 100 / 100
          = 10,000
{noformat}

Makes sense: all 10K detail rows match some master row, though 3/4 of the 
master rows are unreferenced.

Using the formula from the code:

{noformat}
cardinality = |D| * |M| / |D.fk|
            = 10,000 * 100 / 25
            = 10,000 * 4
            = 40,000
{noformat}

As a result, the fewer of the primary keys are referenced, the more we expect 
the input rows to be replicated and the greater the (mis)estimated output 
cardinality. This, of course, leads to poor join selection and thus a poor plan.

Let's check this empirically by printing out the `ndvRatio` (see code snippet 
above) when running some queries:

{noformat}
ndv ratio = 2.0977582147609786
ndv ratio = 3.3183831043488135
ndv ratio = 2.0977582147609786
ndv ratio = 2.0977582147609786
{noformat}

So, this seems like a real bug.

h4. Compound Keys

IMPALA-8018 discusses the correct way to handle compound keys and correlated 
filters. Impala has an implementation: in {{JoinNode.getFkPkEqJoinConjuncts()}}:

{code:java}
      double jointNdv = 1.0;
      for (EqJoinConjunctScanSlots slots: fkPkCandidate) jointNdv *= 
slots.rhsNdv();
{code}

However, the calculations become muddled in the actual cardinality estimation 
in {{JoinNode.getFkPkJoinCardinality()}}

{code:java}
    long result = -1;
    for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
      // Adjust the join selectivity based on the NDV ratio to avoid 
underestimating
      // the cardinality if the PK side has a higher NDV than the FK side.
      double ndvRatio = 1.0;
      if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv();
      double rhsSelectivity = Double.MIN_VALUE;
      if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard / slots.rhsNumRows();
      long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
      if (result == -1) {
        result = joinCard;
      } else {
        result = Math.min(result, joinCard);
      }
    }
{code}

The above tries to work to the key cardinality after selection ({{|T.k'|}}), 
but the math is off. Rather than using the actual selectivity from the input 
node, the code attempts to use the standard formula to derive the selectivity. 
Unfortunately, for join nodes, we cannot recover the original table 
selectivity, so the above gives the wrong answers.

h4. Proposed Fix

All of this leads to the proposed fix, which is to implement the changes 
proposed in IMPALA-8018.


> Incorrect M:1 (FK/PK) cardinality estimation
> --------------------------------------------
>
>                 Key: IMPALA-8014
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8014
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> The math is wrong when we estimate the cardinality of a join we've labeled as 
> "FK/PK" (commonly known as many-to-one or M:1.)
> h4. Join Estimation
> The standard calculation for joins is explained in Swami & Schiefer, [On the 
> Estimation of Join Result 
> Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
>  (S&S). Note especially section 3, Background. Also see [How Good Are Query 
> Optimizers, Really?|http://www.vldb.org/pvldb/vol9/p204-leis.pdf] by Leis et 
> al.
> {noformat}
>                 |L'| * |R'| 
> |L' ⋈ R'| = ------------------
>             max(|L.k'|, |R.k'|)
> {noformat}
> Here:
> * {{|x|}} is the row cardinality if x is a relation, the NDV (domain 
> cardinality) if x is a column.
> * {{L}}, {{R}} are left and right input relations.
> * {{L.k}} and {{R.k}} are the (possibly compound) join keys.
> * {{x’}} is the result of applying a filter to relation or column {{x}}.
> Intuitively, the cardinality of the join is the Cartesian product reduced by 
> the largest column domain. Since the relations fed into the join are 
> filtered, we are concerned with the new, filtered relations created by the 
> scan nodes.
> Impala, like most planners, makes three basic assumptions (see the first 
> paper above):
> * Uniformity: keys are evenly distributed.
> * Independence: the filtering on the two tables is independent of keys.
> * Containment: that all detail (FK) rows have a corresponding master (FK) row.
> (These are traditional assumptions, but they turn out to be unrealistic in 
> many cases; see the second paper above.)
> h4. M:1 Case
> The planner (unnecessarily) divides join planning into two cases. 
> (IMPALA-8018 describes how the two steps are unnecessary. But, to minimize 
> code change, we live with this process here.)
> * M:1 (AKA "FK/PK", "detail/master", "fact/dimension"). A "foreign key" (FK) 
> in the detail table matches at most one "primary key" (PK) in the master 
> table. ("At most one" because of filtering which may have removed master 
> keys.) Impala assumes that the detail ("FK") table is on the left (probe) 
> side, and the master ("PK") table is on the right ("build") side.
> * M:N (AKA "generic", "many-to-many"). Every key on the left matches 
> potentially many keys on the right.
> If we focus on just the M:1 case we can rename the relations for convenience 
> and observe a simplification:
> {noformat}
> L = D
> R = M
> L.k = M.pk
> R.k = D.fk
> {noformat}
> By definition of M:1:
> {noformat}
> |M.pk'| = |M'|
> {noformat}
> Which gives a revised join expression:
> {noformat}
>                 |D'| * |M'| 
> |D' ⋈ M'| = ------------------
>             max(|D.fk'|, |M'|)
> {noformat}
> The expression above is handy: it shows that we only need three values to 
> compute the join cardinality:
> * {{|D'|}} the output cardinality from the left plan node, which is already 
> available.
> * {{|M'|}} the output cardinality from the right scan node, which is already 
> available. (In Impala, the right node will always be a table scan.)
> * {{|D.fk'|}} the number of foreign key values in the filtered {{D'}} 
> relation. (In relational theory terms, the cardinality of the domain of the 
> foreign key after applying pushed-down selection operations.)
> See IMPALA-XXXX for a necessary adjustment to {{|M'|}} to handle predicates 
> common to both sides.
> We only need to estimate {{|D.fk'|}}. We will work up to this step by step 
> because the intermediate steps help explain the bug in the current code.
> h4. Filtering of Master Table Only
> Let's start with the simplest case, filtering on only the master table:
> {noformat}
> |D.fk'| = |D.fk| <= |M.pk|
> max(|D.fk'|, |M'|) = max(|M.pk|, |M'|) = |M.pk| = |M|
> {noformat}
> So:
> {noformat}   
> |D ⋈ M'| = |D| * |M'| / |M|
> {noformat}
> Intuitively: the probability of any detail row finding a match is simply the 
> selectivity of the master filter, or {{|M'| / |M|}}.
> h4. Filtering on the Foreign Key Column
> The next case is also fairly easy. Suppose we know that the left (detail) 
> scan applied a filter. Impala presently uses an incorrect, but simplified, 
> model for this case. (See IMPALA-XXXX for the correct model.)
> The cardinality of the foreign key is simply the result of applying that 
> filter:
> {noformat}
> |D.fk'| = |D.fk| * sel(f)
> {noformat}
> Let us assume we do not filter the master table in this case, so:
> {noformat}
> |D.fk'| < |D.fk| <= |M.pk|
> max(|D.fk'|, |M|) = |M|
> {noformat}
> And:
> {noformat}
> |D' ⋈ M| = |D'| * |M| / |M| = |D'|
> {noformat}
> We assume "containment" from the S&S paper above: all the foreign keys have a 
> matching primary key if {{|F.fk| <= |M.pk|}}. So, if we filter only on the 
> detail table, all foreign keys find a match and the cardinality of the join 
> is just the cardinality of the left input relation.
> h4. Combined Left and Right Filtering
> Now, let's combine the master and detail filtering cases. In general, we will 
> have filtering on both sides. The {{max}} in the expression automatically 
> combines the cases:
> {noformat}
> |D.fk'| = |D.fk| * |D'| / |D|
>                 |D'| * |M'| 
> |D' ⋈ M'| = ------------------
>             max(|D.fk'|, |M'|)
> {noformat}
> Intuitively, the number of rows is the Cartesian product divided by the 
> larger of the number of keys in either table after the scan. Said another 
> way, each detail row finds a master, unless filtering has removed so many 
> masters that some detail rows find no match, in which case the probability of 
> a match is {{|M'| / |D.fk'|}}.
> h4. Compound Primary Keys
> The final complexity is to consider a compound key. That is:
> {noformat}
> (D.fk1, Dfk2) --> (M.pk1, M.pk2)
> {noformat}
> The foreign key pair points to a matching primary key pair. Here we consider 
> only pairs, but the logic is the same for a compound key with any number of 
> columns. Obviously, by the definition of a join in SQL, the number of keys on 
> each side must be the same.
> If we know (from HMS metadata) that the above pairs are, in fact, the keys, 
> then we can make a simplifying assumption:
> {noformat}
> |(M.pk1, M.pk2)| = |M|
> |(D.fk1, Dfk2)| <= |(M.pk1, M.pk2)| = |M|
> {noformat}
> If we don't know a-priori that a pair (k1, k1) is a foreign or primary key, 
> then we can estimate its cardinality (assuming independence of values) as:
> {noformat}
> |(k1, k2)| = |k1| * |k2|
> {noformat}
> In the M:1 case, the primary key cardinality must be the same as table 
> cardinality. If the two columns are completely independent, then:
> {noformat}
> |(M.pk1, M.pk2)| = |M.pk1| * |M.pk2| = |M|
> {noformat}
> More typically, there is some correlation between the columns so:
> {noformat}
> |(M.pk1, M.pk2)| = |M| <= |M.pk1| * |M.pk2|
> {noformat}
> Indeed, Impala uses the above relation to decide we have the FK/PK case. Said 
> another way, if Impala follows the FK/PK logic, then we can simply assume 
> that {{|M.pk| = |M|}} even if the key is compound.
> h4. Compound Foreign Keys
> Foreign keys require a bit more thought. We could have a detail table with 
> only one row. If we have many detail rows, the assumption of containment says 
> that each detail record points to some master record, so: 
> {noformat}
> |D.fk| <= |M.pk|
> {noformat}
> This lets us estimate the cardinality of a compound foreign key as:
> {noformat}
> |(D.fk1, D.fk2)| = min( |D.fk1| * |D.fk2|, |M| )
> {noformat}
> That is, if the detail table is small, the left term is a reasonable estimate 
> (ignoring the urn model issue). But, as the table gets larger, and begins to 
> include most primary keys, we know that the number of foreign keys can't be 
> larger than the number of primary keys, so the right term is the better 
> estimate.
> h4. Filtering on Compound Keys
> Suppose that the join inputs have filtering applied to the left (detail) 
> table. We discussed how to handle this fo a single column. For a compound 
> key, we can observe that the cardinality of the key is the product of the 
> cardinality of the columns, but (using the containment assumption), no larger 
> than the cardinality of the primary key (which is the cardinality of the 
> master table.)
> So:
> {noformat}
> |D.fk| = min( ∏ |D.fki|, |M| )
>                            
> |D.fk'| = |D.fk| * |D'| / |D|
>         = min( ∏ |D.fki|, |M| ) * |D'| / |D|
> {noformat}
> So the final expression for join cardinality is:
> {noformat}
> |D.fk'| = min( ∏ |D.fki|, |M| ) * |D'| / |D|
>                 |D'| * |M'| 
> |D' ⋈ M'| = -------------------
>             max( |D.fk'|, |M'|)
> {noformat}
> The first expression says that the compound foreign key cardinality is either 
> the product of the columns that make up the key, or the cardinality of the 
> primary key, whichever is less. We then adjust that amount (incorrectly) by 
> the percentage of the detail table that is scanned.
> The second expression is just the Cartesian product divided by the largest 
> key cardinality: either foreign key or primary key (which is, by definition, 
> equal to the cardinality of the master table.) 
> h4. Complexity: Compound Joins
> The above expression works well if the left input to a join is a base table 
> row which we know the original table cardinality that corresponds to the 
> original column NDV. The above can be used as-is in a M:N ("generic") join 
> for a right-side table. But, when when used on the left side, we must recall 
> that Impala builds left-deep join plans, so the left side may be a join. In 
> this case, there is no original left-side base table.
> See IMPALA-XXXX for discussion of the complexities in this case. A quick and 
> dirty solution is to use the scan output cardinality in place of the left 
> input cardinality. That is:
> * Determine the table that contains the join key.
> * Search the left subtree for the scan for that table.
> To do this, start with the left input:
> * If the node is a scan node, and the scan is for the target table, return 
> the output cardinality of that scan.
> * If the node is a join, then apply the search to *both* sides of the join.
> This approach is cumbersome, and may run into complexities if the node is 
> something other than a scan or join. It also may underestimate if a filter is 
> applied at the join level.
> A cleaner, tough more involved, solution is to track adjusted NDV for each 
> column through each operator as described in IMPALA-8220.
> h4. Code Bug
> The commit "IMPALA-5547: Rework FK/PK join detection", ID 
> [{{9f678a74269250bf5c7ae2c5e8afd93c5b3734de}}|https://github.com/apache/impala/commit/9f678a74269250bf5c7ae2c5e8afd93c5b3734de#diff-b10ccc2cdf68b236be400cde1e858a7c]
>  on Jun 6, 2017 reworked the FK/PK logic. It has one flaw: after determining 
> that we have an FK/PK (M:1) case, it then attempts to adjust the compound FK 
> and PK columns. This has two problems:
> * It is unnecessary, as we saw above.
> * The math is wrong and produces bogus estimates.
> Here is the code in {{JoinNode.java}}:
> {code:java}
>     long result = -1;
>     for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
>       // Adjust the join selectivity based on the NDV ratio to avoid 
> underestimating
>       // the cardinality if the PK side has a higher NDV than the FK side.
>       double ndvRatio = 1.0;
>       if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv();
>       double rhsSelectivity = Double.MIN_VALUE;
>       if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard / 
> slots.rhsNumRows();
>       long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
>       if (result == -1) {
>         result = joinCard;
>       } else {
>         result = Math.min(result, joinCard);
>       }
>     }
>     // FK/PK join cardinality must be <= the lhs cardinality.
>     result = Math.min(result, lhsCard);
> {code}
> The above is hard to follow, which may account for why the bug was not 
> caught. Ignoring some corner cases, the logic is essentially:
> {noformat}
> lhsCard = |D'|
> ndvRatioi = |D.fki| / |M.pki|
> rhsSelectivityi = |M'| / |M|
> joinCardi = lhsCard * ndvRatioi * rhsSelectivityi
>           = |D'| * (|D.fki| / |M.pki|) * |M'| / |M|
> |join| = min(joinCardi)
>        = (|D'| * |M'| / |M|) * min(|D.fki| / |M.pki|)
> {noformat}
> Though hard to see, the above is not equivalent to the logic worked out in 
> the previous section. Using the correct expression from earlier sections:
> {noformat}
>                             |D'| * |M'| 
> |D' ⋈ M'| = ----------------------------------------------
>             max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
> {noformat}
> We can factor out the common {{|D'| * |M'|}} terms and compare:
> {noformat}
> min(|D.fki| / |M.pki|)                     1
> ----------------------  !=  ----------------------------------------------
>         |M|                 max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|)
>              1                                  1
> ----------------------------  !=  
> ----------------------------------------------
> |M| * max(|M.pki| / |D.fki|)      max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), 
> |M'|)
> |M| * max(|M.pki| / |D.fki|)  !=  max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), 
> |M'|)
> {noformat}
> There is no valid operation that can convert one side to the other, so they 
> are unequal.
> It is likely that the code's version attempts to work around issues elsewhere 
> in the calculations (such as ignoring some predicates, using exponential 
> back-off for filters, not having a good estimate for {{|D|}}, etc.)
> h4. Longer-Term Fix
> The above simple fix is the target of this ticket. Longer term, the code 
> should evolve to use a single path for both the M:1 and M:N cases since as 
> described in IMPALA-8018. (Both cases start with HMS data. Currently we use 
> two paths to arrive at the same result. IMPALA-8018 suggests we need only one 
> path.) We should also adopt the simple urn model as described in IMPALA-8218.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to