[ 
https://issues.apache.org/jira/browse/IMPALA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated IMPALA-8014:
--------------------------------
    Description: 
The math is wrong when we estimate the cardinality of a join we've labeled as 
"FK/PK" (commonly known as many-to-one or M:1.)

TL;DR: skip to the end for the proposed change.

Join logic is complex. To ensure that the analysis is sound we work it out from 
first principles, then verify against the Swami & Schiefer, [On the Estimation 
of Join Result 
Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
 (S&S). Note especially section 3, Background.
h4. Definitions

The following terms are used below:
 * _Relation:_ either a _base table_ or the result of a join. (Relational 
theory says that both are simply relations.) We use an upper case letter, 
typically R, for a relation. E.g.: {{R1}}, {{R1}}
 * _Key:_ A column used in a join. We use a lower case letter for a column, and 
typically use “k” for a key, along with the relation number. E.g. {{k1}} and 
{{ON R1.k1a = R2.k2a}}.
 * _Compound key:_ A join key made up of multiple columns. We use a letter to 
denote each column. E.g. {{k1a}} and {{ON R1.k1a = R2.k2a AND R1.k1b = R2.k2b}}
 * _Join:_ the join of two relations. Impala supports the join of two base 
tables (a table-to-table join), or the join of a base table and another join (a 
table-to-join join.) We use the symbol {{⋈}} to denote a join. Impala typically 
uses a hash join.
 * _Relation cardinality:_ the number of rows in the relation, denoted {{|R|}}.
 * _Join cardinality:_ the cardinality of the relation produced by the join. 
That is, the join’s output cardinality. Denoted as {{|R1 ⋈ R2|}}.
 * _Key column cardinality:_ the number of unique values in a column, denoted 
{{|k|}}. Also known as the NDV of the column or {{ndv(k)}}.

In Impala, HMS provides the relation and join key cardinality as table and 
column statistics respectively.

Since Impala typically uses hash joins, it is helpful to use terminology 
specific to that case:
 * _Probe_ side of a join: the larger side (ideally) of a hash join. Also 
called the “left” side. Is known as {{chiild(0)}} in the code. Appears at the 
same level as the join in the query plan (which would be to the left if the 
plan were rotated 90 degrees counter-clockwise.)
 * _Build_ side of a join: the smaller side (ideally) of a hash join. Also 
called the “right” side. Is known as {{child(1)}} in the code. Appears indented 
directly under the join in the query plan.

In Impala:
 * The detail table in a M:1 relationship is always on the probe (left) side of 
a join. Represented as {{child(0)}} in the code.
 * The master table in a M:1 relationship is always on the build (right) side 
of a join. Represented as {{child(1)}} in the code.

Finally, we also need:
 * The _scan_ of a table: the (possibly smaller) relation produced by applying 
one or more predicates while scanning a table. We are concerned with the 
cardinality of the scan, denoted as {{|R1'|}}. We assume that Impala has 
already used rules (not discussed here) to estimate the cardinality of the 
selection.

h4. Deriving the Join Formula

In RDBMS, a primary key (PK) is a column (or, more typically, set of columns) 
that uniquely identify a row in the master table (M). Primary keys are 
generally indexed via a unique index. Since keys are unique:
{noformat}
|M.pk| = |M|
{noformat}
The Detail table forms a M:1 relationship with the master table. Each foreign 
key (FK) in the detail table (D) references one primary key in the master 
table. Because of the M:1 relationship:
{noformat}
|D.fk| <= |M| << |D|
{noformat}
If we read all rows from both tables, and all primary keys appear as foreign 
keys, the the join cardinality is simply:
{noformat}
|D ⋈ M| = |D|               [Equation 1]
{noformat}
h5. Filtering Master Rows

Let’s consider what happens when we filter out master (M) table rows producing 
a subset M'. To do so, we make three assumptions:
 * A uniform distribution of foreign keys, (the “Uniformity” assumption in the 
S&S paper cited above),
 * Every primary key is referenced by an equal number of foreign keys (implied 
by the uniformity assumption),
 * The filter results in a random sampling of master rows not correlated with 
the join keys.

Then the probability of any particular primary key appearing is:
{noformat}
p(pk) = |M'| / |M|
{noformat}
The result value is a probability (hence the {{p()}} function) given by the 
ratio of selected rows to total table rows (from basic probability theory.) The 
value runs from 0 (no master rows match the scan predicates) to 1 (all rows 
present).

The revised join cardinality is:
{noformat}
|D ⋈ M'| = |D| * p(pk)
         = |D| * |M'| / |M|                [Equation 2]
{noformat}
h5. Filtering Detail Rows

Suppose we instead filter detail rows to produce a new subset D'. Again we make 
some assumptions:
 * Uniform distribution of foreign key values across primary keys.
 * The "Containment" assumption from the S&S paper that the set of foreign keys 
is a (possibly full) subset of the set of primary keys.
 * The filter results in a random sampling of detail rows not correlated with 
the join keys.

With these assumptions, we can see that if we join D' with the master table M, 
every row that remains in D' will still find a match in M, so:
{noformat}
|D' ⋈ M| = |D'|
{noformat}
h5. Filtering Both Master and Detail Rows

Combining the two selection models produces:
{noformat}
|D' ⋈ M'| = |D'| * |M'| / |M|           [Equation 3]
{noformat}
h5. Non-Containment

The above is based on the S&S Containment assumption: that the set of foreign 
keys is a subset of the set of primary keys. In a normal RDBMS with integrity 
constraints, this is a valid assumption. But, in Big Data, things are messy and 
we can’t actually make this assumption. Fortunately, there is a way loosen the 
containment assumption.

Suppose the master table has half the number of keys as the detail table. Using 
the uniformity assumption, half the foreign keys will go unmatched. If there 
are four times as many foreign keys as primary keys, only a quarter of the 
detail rows will find matches. The probability of any one foreign key finding a 
match is:
{noformat}
p(match) = |M.pk| / |D.fk|
{noformat}
The above applies only if {{|D.fk| > |M.pk|}}; no adjustment is needed if 
{{|D.fk| < |M.pk|}}. We can express this mathematically as:
{noformat}
p(match) = |M.pk| / max(|D.fk|, |M.pk|)       [Equation 4]
{noformat}
Combining this with Equation 3, recalling that {{|M.pk| = |M|}}:
{noformat}
|D' ⋈ M'| = |D'| * (|M'| / |M|) * p(match)
          = |D'| * |M'| / |M| * |M.pk| / max(|D.fk|, |M.pk|)
          = |D'| * |M'| * (|M| / |M|) / max(|D.fk|, |M.pk|)
          = |D'| * |M'| / max(|D.fk|, |M|)  
{noformat}
h5. Complete Expression

If we rearrange the above we get the complete expression:
{noformat}
               |D'| * |M'| 
|D' ⋈ M'| = ----------------                  [Equation 5]
            max(|D.fk|, |M|)
{noformat}
h4. The S&S Equation

Let’s check out answer against the S&S paper cited above by assuming the 
master/detail relationship and removing from our equation the affect of scans. 
The paper’s equation:
{noformat}
|R1 ⋈ R2| = min( |k1|, |k2| ) * ( |R1| / |k1| ) * ( |R2| * |k2| )
          = |R1| * |R2| / max( |k1|, |k2| )
{noformat}
If we assume:
 * {{R1 = D}}
 * {{R2 = M}}
 * {{k1 = D.fk}}
 * {{k2 = M.pk}}
 * {{|D.fk| > |M.pk|}}

We get:
{noformat}
|D ⋈ M| = |D| * |M| / max( |D.fk|, |M.pk| )
{noformat}
Which is the same as Equation 5 (assuming no selection). All good.
h4. Join Estimation in Impala

Join cardinality is estimated in {{JoinNode.computeStats()}}, and especially in 
{{getJoinCardinality()}}, which implements the infamous "FK/PK" heuristics. 
"The FK/PK detection logic is based on the assumption that most joins are 
FK/PK. ... In the absence of relevant stats, we assume FK/PK with a join 
selectivity of 1." As noted above, M:1 is more common name for "FK/PK".

During execution, the code first ensures that the larger table is on the probe 
(left) side. So, {{getJoinCardinality()}} assumes that the L is the detail 
table with the foreign key (FK), while the smaller build-side table is the 
master table with the primary key (PK). That is L:R is M:1, joined on {{L.<FK> 
= R.<PK>}}.

Next, let's translate the code's naming to the terms defined above.

A comment in the code explains the cardinality estimation model:
{noformat}
cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
{noformat}
We can check how the code implements the expression:
{code:java}
      ndvRatio = slots.rhsNdv() / slots.lhsNdv();
      double rhsSelectivity = rhsCard / slots.rhsNumRows();
      long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
{code}
So, the comment accurately describes the code.

Let’s translate the code’s expression to the terms defined above:
{noformat}
                 |L'|          |R'|
              ----^-----    ----^-----
cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
{noformat}
Using:
 * {{cardinality}} here means the cardinality of the join, hence {{|𝜎D ⋈ 𝜎M|}}
 * {{L}} = Left = Probe table (before scan) = Detail table, hence {{D}}
 * {{R}} = Right = Build table (before scan) = Master table, hence {{M}}
 * {{child(0)}} is the output of the scan the left table or {{D}}, hence {{𝜎D}}
 * {{child(1)}} is the output of the scan the right table or {{M}}, hence {{𝜎M}}
 * {{R.d}} is the join key of the R (master) table, hence {{M.pk}}
 * {{L.c}} is the join key of the L (detail) table, hence {{D.fk}}
 * NDV is, as we said, the cardinality of the key column, hence {{|k|}}.

Gives:
{noformat}
|D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|).   [Sic]
{noformat}
The code expression is actually overly complex. Recall that, by definition, 
{{|M| = |M.pk|}}:
{noformat}
|D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|)
          = |D'| * (|M'| / |D.fk|) * (|M.pk| / |M|)
          = |D'| * (|M'| / |D.fk|) * (|M| / |M|)
          = |D'| * |M'| / |D.fk|
{noformat}
As we’ll see, this formula is *NOT* correct.

h4. Bug in the Code

Compare Equation 3 to the one in the code:
{noformat}
|D' ⋈ M'| = |D'| * |M'| / |D.fk|                — Code

|D' ⋈ M'| = |D'| * |M'| / max(|D.fk|, |M.pk|)   — Derived
{noformat}
Notice the code contains a severe error: it divides by the foreign key 
cardinality, even when that cardinality is *smaller* than that of the primary 
key, causing the join cardinality to be larger than the probe cardinality, 
which cannot occur if we assume a master/detail (PK/FK) relationship.

To ensure we are reading all this correctly, let' work out an example. Let's 
assume that our detail table has 10K rows, our master table 100 rows, and that 
only 25 of the master records are referenced. Let's assume the scans remove no 
rows. That is:
 * {{|D'| = |D| = 10,000}}
 * {{|M'| = |M| = = |M.pk| = 100}}
 * {{|D.fk| = 25}}

Using Equation 3:
{noformat}
|D' ⋈ M'| = |D'| * |M'| / max( |D.fk|, |M.pk| )
          = |D| * |M| / max( |D.fk|, |M.pk| )
          = 10,000 * 100 / max( 100, 25 )
          = 10,000 * 100 / 100
          = 10,000
{noformat}
Makes sense: all 10K detail rows match some master row, though 3/4 of the 
master rows are unreferenced.

Using the formula from the code:
{noformat}
cardinality = |D| * |M| / |D.fk|
            = 10,000 * 100 / 25
            = 10,000 * 4
            = 40,000
{noformat}
As a result, the fewer of the primary keys are referenced, the more we expect 
the input rows to be replicated and the greater the (mis)estimated output 
cardinality. This, of course, leads to poor join selection and thus a poor plan.

Let's check this empirically by printing out the `ndvRatio` (see code snippet 
above) when running some queries:
{noformat}
ndv ratio = 2.0977582147609786
ndv ratio = 3.3183831043488135
ndv ratio = 2.0977582147609786
ndv ratio = 2.0977582147609786
{noformat}
So, this seems like a real bug.
h4. Compound Keys

The discussion thus far has assumed a single key column per table so that we 
have NDV values for each. The next challenge occurs if the query has compound 
columns:
{noformat}
SELECT …
FROM T1, T2
WHERE T1.k1a = T2.k2a AND T1.k1b = T2.k2b
{noformat}
We will use the symbol {{j}} to mean a joint (compound) key. The joint key for 
table {{T1}} is {{T1.j1}} and {{||T1.j1|}} is the cardinality (NDV) of that key.

In an RDBMS, if an index exists on (k1a, k1b) and (k2a, k2b) then it is 
possible to determine the joint NDV (which is just the number of index 
entries.) In HMS, we have the individual NDVs, but not the joint NDV.

The S&S paper supplies the require assumption: “Independence: Within each 
relation, values chosen from distinct columns that are involved in join 
predicates are independent.” This lets us estimate the joint NDV:

One way to think of this is the the columns form a matrix, say (year, month). 
Every year has 12 months, so there is no correlation between the two. If we 
have data for 10 years, we will have 10 * 12 = 120 distinct key values.
{noformat}
|T.j| = |T.k1| * |T.k2| * … * |T.kn|
      = ∏ |T.ki|
{noformat}
Where {{∏}} is the product operator. (S&S, section 3.3, calls this the 
"Multiplicative Rule.")

Of course, keys are not always independent. Suppose the keys are (state, city). 
In this case, there are 50 states. But most city names occur in only one state, 
so NDV(state) * NDV(city) >> NDV(state, city). Still, we know that, regardless 
of the correlation, we can never have more keys than rows. We can adjust the 
expression to:
{noformat}
|T.j| = min( ∏ |T.ki|, |T| )
{noformat}
We can also assume that joins are typically formulated to be as selective as 
possible. So the assumption may not be too far off. Still, if a file contains 
addresses, so the key is (state, city, street), and we just the (state, city) 
prefix, the joint NDV will smaller than the table cardinality. So we need both 
terms above.

Adding this to the expression derived above:
{noformat}
                             |D'| * |M'|
|D' ⋈ M'| = ---------------------------------------------      [Equation 4]
            max(min(∏ |D.fki|, |D|), min(∏ |M.pki|, |M|))
{noformat}
Impala does use the multiplicative rule. In 
{{JoinNode.getFkPkEqJoinConjuncts()}}:
{code:java}
      double jointNdv = 1.0;
      for (EqJoinConjunctScanSlots slots: fkPkCandidate) jointNdv *= 
slots.rhsNdv();
{code}
However, the calculations become muddled in the actual cardinality estimation 
in {{JoinNode.getFkPkJoinCardinality()}}
{code:java}
    long result = -1;
    for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
      // Adjust the join selectivity based on the NDV ratio to avoid 
underestimating
      // the cardinality if the PK side has a higher NDV than the FK side.
      double ndvRatio = 1.0;
      if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv();
      double rhsSelectivity = Double.MIN_VALUE;
      if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard / slots.rhsNumRows();
      long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
      if (result == -1) {
        result = joinCard;
      } else {
        result = Math.min(result, joinCard);
      }
    }
{code}
h4. Proposed Fix

All of this leads to the proposed fix, which can be expressed succinctly use 
the correct expression: Equation 4 above.

  was:
The math is wrong when we estimate the cardinality of a join we've labeled as 
"FK/PK" (commonly known as many-to-one or M:1.)

This is a version of the notes found 
[here|https://github.com/paul-rogers/impala/wiki/Planner].

h4. Background

Join cardinality is estimated in {{JoinNode.computeStats()}}, and especially in 
{{getJoinCardinality()}}, which implements the infamous "FK/PK" heuristics. 
"The FK/PK detection logic is based on the assumption that most joins are 
FK/PK. ... In the absence of relevant stats, we assume FK/PK with a join 
selectivity of 1." As noted above, M:1 is more common name for "FK/PK".

Earlier logic ensured that the larger table is on the probe (left) side. So, 
the code assumes that the L is the detail table with the foreign key (FK), 
while the smaller build-side table is the master table with the primary key 
(PK). That is L:R is M:1, joined on {{L.<FK> = R.<PK>}}.

Next, let's translate the code's naming to the more traditional (and 
descriptive, if somewhat offensive) RDBMS naming:

* *master table:* the 1 side fo the M:1 relationship, has a unique primary key 
(PK) typically enforced via a unique index. Referred to as the R (build) table 
in Impala. We will use the symbol {{m}} for the master table, and {{m.pk}} for 
the (presumed) primary key column.
* *detail table:* the M side of the M:1 relationship, has a non-unique foreign 
key (FK) that points to one entry in the master table. Referred to as the L 
(probe) table in Impala. We will use the symbol {{d}} for the detail table, and 
{{d.fk}} for the (presumed) foreign key column.

And, we need some notation:

* {{|x|}} is the cardinality of relation (table) {{x}}. The relation can be a 
base table or the output of a prior join.
* {{|scan\(x)|}} is the cardinality out of an Impala scan node after applying 
filter predicates.
* {{ndv\(x)}} is the NDV of column {{x}}.

To paraphrase a comment in the code, cardinality estimation is:

{noformat}
|d >< m| = |scan(d)| * (|scan(m)| / |m|) * (ndv(m.pk) / ndv(d.fk))
{noformat}

Where:

* {{|scan\(d)|}} is the cardinality of the detail scan after applying any 
predicates to the table cardinality.
* {{|scan\(m)|}} is the same for the master scan.
* {{|m|}} is the cardinality of the master table itself.
* {{ndv(m.pk)}} is the NDV of the presumed primary key column in the master 
(right, build, smaller) table.
* {{ndv(d.fk)}} is the NDV of the presumed foreign key column in the detail 
(left, probe, larger) table.

h4. Deriving the Formula

Let's tear this apart.

In RDBMS, a Primary Key is a unique column (or, more typically, set of columns) 
that are generally indexed via a unique index, ensuring uniqueness. Since keys 
are unique:

{noformat}
ndv(m.pk) = |m|
{noformat}

Then since there is a M:1 relationship between the detail (d) and master (m) 
tables:

{noformat}
ndv(d.fk) <= |m| << |d|
{noformat}

If we read all rows from both tables, and all primary keys appear as foreign 
keys, the the join cardinality is simply:

{noformat}
|d >< m| = |d|
{noformat}

If we filter out some detail rows, then cardinality is:

{noformat}
|d >< m| = |scan(d)|
{noformat}

If we filter out rows in the master table, and we assume uniform distribution 
of foreign keys, then the probability of any foreign key finding a primary key 
match is:

{noformat}
p(pk) = |scan(m)| / |m|
{noformat}

The result varies is a probabilty of a given pk presenting to the join that 
runs from 0 (no rows match the predicates) to 1 (all rows present).

The revised join cardinality is:

{noformat}
|d >< m| = |scan(d)| * p(pk)
           |scan(d)| * |scan(m)| / |m|
{noformat}

There is some built-in selectivity: perhaps not all the primary keys are 
actually referenced. In fact, we can figure out how many actually are used. 
Lets assume that not all primary keys appear as foreign keys in the detail 
table. Then, the probability of a match is:

{noformat}
p(match) = ndv(d.fk) / ndv(m.pk)
{noformat}

If all primary keys appear in the detail table, the probability of a match is 
1. If none appear (the detail table is empty), then the probability is 0.

On the other if more foreign keys exist than primary keys, then some of the 
foreign keys won't find a match on an inner join, so:

{noformat}
p(match) = ndv(m.pk) / ndv(d.fk)
{noformat}

If all primary keys appear in the detail table, the probability of a match is 
1. If there are twice as many foreign keys as primary keys, the probability of 
a match is 50%.

Combining the two cases:

{noformat}
p(match) = ndv(d.fk) / ndv(m.pk) if ndv(d.fk) < ndv(m.pk),
           ndv(m.pk) / ndv(d.fk) otherwise.
{noformat}

(As it turns out, the above logic is used in the non-PK/FK, case in the Planner 
code. But, as we'll see, half is missing in the FK/PK code.)

Adding this to the join cardinality estimate for the normal case of fewer FK's 
than PK's we get:

{noformat}
|d >< m| = |scan(d)| * p(pk) * p(match)
         = |scan(d)| * |scan(m)| / |m| * ndv(d.fk) / ndv(m.pk)
{noformat}

On the other hand, if we assume more foreign keys than primary keys, we get:

{noformat}
|d >< m| = |scan(d)| * p(pk) * p(match)
         = |scan(d)| * |scan(m)| / |m| * ndv(m.pk) / ndv(d.fk)
{noformat}

h4. Bug in the Code

Compare the above derived formula to the one in the code:

{noformat}
cardinality = |scan(d)| * (|scan(m)| / |m|) * (NDV(m.pk) / NDV(d.fk))
{noformat}

We can check how the code implements the formula:

{code:java}
      ndvRatio = slots.rhsNdv() / slots.lhsNdv();
      double rhsSelectivity = rhsCard / slots.rhsNumRows();
      long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
{code}

So, the comment accurately describes the code.

But, it contains a severe error. Consider a normal 1:M case: the master-detail 
relationship such as customers and orders. Not all customers have an order, but 
all orders must have been placed by customers. As a result, the normal use is 
fewer foreign keys than primary keys. But, the formula above is for the 
{{ndv(d.fk) > ndv(f.pk)}} case. Why is this wrong?

Assume half of customers never placed an order. The ratio in the code is 
{{NDV(m.pk) / NDV(d.fk)}} or, say, 100 customers / 50 customers who placed an 
order or 2. That is, the fewer customer who have placed an order, the *greater* 
the assumed join cardinality.

The common case in data warehousing is a fact (detail) table and a dimension 
(master) table. Again, if the dimension table has more keys than the detail 
table uses, the cardinality estimate will over estimate.

When the ratio is low, the cardinality estimate may become so high that the 
planner rejects a valid join in favor of one that seems better, but is actually 
much worse at execution time.

The actual formula in the comments (with some annotations) is:

{noformat}
              |scan(L)|     |scan(R)|
              ----^-----    ----^-----
cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))

L = detail, L.c = d.fk
R = master, R.d = m.pk
{noformat}

To ensure we are reading all this correctly, let' work out a detailed example. 
Let's assume that our detail table has 10K rows, our master table 100 rows, and 
that only 25 of the master records are referenced. Let's assume the scans 
remove no rows. Using the formula derived above:

{noformat}
|join| = |scan(d)| * (|scan(m)| / |m|) * (ndv(d.fk) / ndv(m.pk))
       = |d| * (|m| / |m|) * (ndv(d.fk) / ndv(m.pk))
       = 10,000 * 1 * (25 / 100)
       = 10,000 / 4
       = 2,500
{noformat}

Using the formula from the code:

{noformat}
L = d
R = m
cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
            = |scan(L)| * (|scan(R)| / |R|) * (NDV(R.d) / NDV(L.c))
            = |scan(d)| * (|scan(m)| / |m|) * (NDV(m.pk) / NDV(d.fk))
            = |d| * (|m| / |m|) * (NDV(m.pk) / NDV(d.fk)) 
            = 10,000 * 1 * (100 / 25)
            = 10,000 * 4
            = 40,000
{noformat}

As a result, the fewer of the primary keys are referenced, the more we expect 
the input rows to be replicated.

Let's check this empirically by printing out the `ndvRatio` when running some 
queries:

{noformat}
ndv ratio = 2.0977582147609786
ndv ratio = 3.3183831043488135
ndv ratio = 2.0977582147609786
ndv ratio = 2.0977582147609786
{noformat}

So, this seems like a real bug.

h4. Proposed Fix

The easiest fix is just to invert the ratio. But, running {{PlannerTest}} shows 
that we need to handle both cases:

{noformat}
ndv ratio = 1.5245451773554224
ndv ratio = 0.10545630784098954
ndv ratio = 0.959424038561171
ndv ratio = 335.3333333333333
{noformat}

If we inverted these as proposed, we'd see:

{noformat}
0.67
10
1.04
0.003
{noformat}

This means we have cases where {{ndv(f.pk) > ndv(m.pk)}} *and* {{ndv(f.pk) < 
ndv(m.pk)}}. Thus, we need to compute the probability using the two-part 
calculation shown above.

The proposed fix is to add that calculation to the code so that the ndv ratio 
in the snippet of {{PlannerTest}} above would be:

{noformat}
0.67
0.10
0.96
0.003
{noformat}

The probability is always less than 1.

As noted in the text above, the code already (attempts) to use this calculation 
in the non-PK/FK case (though that code also has bugs.)


> Incorrect M:1 (FK/PK) cardinality estimation
> --------------------------------------------
>
>                 Key: IMPALA-8014
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8014
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> The math is wrong when we estimate the cardinality of a join we've labeled as 
> "FK/PK" (commonly known as many-to-one or M:1.)
> TL;DR: skip to the end for the proposed change.
> Join logic is complex. To ensure that the analysis is sound we work it out 
> from first principles, then verify against the Swami & Schiefer, [On the 
> Estimation of Join Result 
> Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
>  (S&S). Note especially section 3, Background.
> h4. Definitions
> The following terms are used below:
>  * _Relation:_ either a _base table_ or the result of a join. (Relational 
> theory says that both are simply relations.) We use an upper case letter, 
> typically R, for a relation. E.g.: {{R1}}, {{R1}}
>  * _Key:_ A column used in a join. We use a lower case letter for a column, 
> and typically use “k” for a key, along with the relation number. E.g. {{k1}} 
> and {{ON R1.k1a = R2.k2a}}.
>  * _Compound key:_ A join key made up of multiple columns. We use a letter to 
> denote each column. E.g. {{k1a}} and {{ON R1.k1a = R2.k2a AND R1.k1b = 
> R2.k2b}}
>  * _Join:_ the join of two relations. Impala supports the join of two base 
> tables (a table-to-table join), or the join of a base table and another join 
> (a table-to-join join.) We use the symbol {{⋈}} to denote a join. Impala 
> typically uses a hash join.
>  * _Relation cardinality:_ the number of rows in the relation, denoted 
> {{|R|}}.
>  * _Join cardinality:_ the cardinality of the relation produced by the join. 
> That is, the join’s output cardinality. Denoted as {{|R1 ⋈ R2|}}.
>  * _Key column cardinality:_ the number of unique values in a column, denoted 
> {{|k|}}. Also known as the NDV of the column or {{ndv(k)}}.
> In Impala, HMS provides the relation and join key cardinality as table and 
> column statistics respectively.
> Since Impala typically uses hash joins, it is helpful to use terminology 
> specific to that case:
>  * _Probe_ side of a join: the larger side (ideally) of a hash join. Also 
> called the “left” side. Is known as {{chiild(0)}} in the code. Appears at the 
> same level as the join in the query plan (which would be to the left if the 
> plan were rotated 90 degrees counter-clockwise.)
>  * _Build_ side of a join: the smaller side (ideally) of a hash join. Also 
> called the “right” side. Is known as {{child(1)}} in the code. Appears 
> indented directly under the join in the query plan.
> In Impala:
>  * The detail table in a M:1 relationship is always on the probe (left) side 
> of a join. Represented as {{child(0)}} in the code.
>  * The master table in a M:1 relationship is always on the build (right) side 
> of a join. Represented as {{child(1)}} in the code.
> Finally, we also need:
>  * The _scan_ of a table: the (possibly smaller) relation produced by 
> applying one or more predicates while scanning a table. We are concerned with 
> the cardinality of the scan, denoted as {{|R1'|}}. We assume that Impala has 
> already used rules (not discussed here) to estimate the cardinality of the 
> selection.
> h4. Deriving the Join Formula
> In RDBMS, a primary key (PK) is a column (or, more typically, set of columns) 
> that uniquely identify a row in the master table (M). Primary keys are 
> generally indexed via a unique index. Since keys are unique:
> {noformat}
> |M.pk| = |M|
> {noformat}
> The Detail table forms a M:1 relationship with the master table. Each foreign 
> key (FK) in the detail table (D) references one primary key in the master 
> table. Because of the M:1 relationship:
> {noformat}
> |D.fk| <= |M| << |D|
> {noformat}
> If we read all rows from both tables, and all primary keys appear as foreign 
> keys, the the join cardinality is simply:
> {noformat}
> |D ⋈ M| = |D|               [Equation 1]
> {noformat}
> h5. Filtering Master Rows
> Let’s consider what happens when we filter out master (M) table rows 
> producing a subset M'. To do so, we make three assumptions:
>  * A uniform distribution of foreign keys, (the “Uniformity” assumption in 
> the S&S paper cited above),
>  * Every primary key is referenced by an equal number of foreign keys 
> (implied by the uniformity assumption),
>  * The filter results in a random sampling of master rows not correlated with 
> the join keys.
> Then the probability of any particular primary key appearing is:
> {noformat}
> p(pk) = |M'| / |M|
> {noformat}
> The result value is a probability (hence the {{p()}} function) given by the 
> ratio of selected rows to total table rows (from basic probability theory.) 
> The value runs from 0 (no master rows match the scan predicates) to 1 (all 
> rows present).
> The revised join cardinality is:
> {noformat}
> |D ⋈ M'| = |D| * p(pk)
>          = |D| * |M'| / |M|                [Equation 2]
> {noformat}
> h5. Filtering Detail Rows
> Suppose we instead filter detail rows to produce a new subset D'. Again we 
> make some assumptions:
>  * Uniform distribution of foreign key values across primary keys.
>  * The "Containment" assumption from the S&S paper that the set of foreign 
> keys is a (possibly full) subset of the set of primary keys.
>  * The filter results in a random sampling of detail rows not correlated with 
> the join keys.
> With these assumptions, we can see that if we join D' with the master table 
> M, every row that remains in D' will still find a match in M, so:
> {noformat}
> |D' ⋈ M| = |D'|
> {noformat}
> h5. Filtering Both Master and Detail Rows
> Combining the two selection models produces:
> {noformat}
> |D' ⋈ M'| = |D'| * |M'| / |M|           [Equation 3]
> {noformat}
> h5. Non-Containment
> The above is based on the S&S Containment assumption: that the set of foreign 
> keys is a subset of the set of primary keys. In a normal RDBMS with integrity 
> constraints, this is a valid assumption. But, in Big Data, things are messy 
> and we can’t actually make this assumption. Fortunately, there is a way 
> loosen the containment assumption.
> Suppose the master table has half the number of keys as the detail table. 
> Using the uniformity assumption, half the foreign keys will go unmatched. If 
> there are four times as many foreign keys as primary keys, only a quarter of 
> the detail rows will find matches. The probability of any one foreign key 
> finding a match is:
> {noformat}
> p(match) = |M.pk| / |D.fk|
> {noformat}
> The above applies only if {{|D.fk| > |M.pk|}}; no adjustment is needed if 
> {{|D.fk| < |M.pk|}}. We can express this mathematically as:
> {noformat}
> p(match) = |M.pk| / max(|D.fk|, |M.pk|)       [Equation 4]
> {noformat}
> Combining this with Equation 3, recalling that {{|M.pk| = |M|}}:
> {noformat}
> |D' ⋈ M'| = |D'| * (|M'| / |M|) * p(match)
>           = |D'| * |M'| / |M| * |M.pk| / max(|D.fk|, |M.pk|)
>           = |D'| * |M'| * (|M| / |M|) / max(|D.fk|, |M.pk|)
>           = |D'| * |M'| / max(|D.fk|, |M|)  
> {noformat}
> h5. Complete Expression
> If we rearrange the above we get the complete expression:
> {noformat}
>                |D'| * |M'| 
> |D' ⋈ M'| = ----------------                  [Equation 5]
>             max(|D.fk|, |M|)
> {noformat}
> h4. The S&S Equation
> Let’s check out answer against the S&S paper cited above by assuming the 
> master/detail relationship and removing from our equation the affect of 
> scans. The paper’s equation:
> {noformat}
> |R1 ⋈ R2| = min( |k1|, |k2| ) * ( |R1| / |k1| ) * ( |R2| * |k2| )
>           = |R1| * |R2| / max( |k1|, |k2| )
> {noformat}
> If we assume:
>  * {{R1 = D}}
>  * {{R2 = M}}
>  * {{k1 = D.fk}}
>  * {{k2 = M.pk}}
>  * {{|D.fk| > |M.pk|}}
> We get:
> {noformat}
> |D ⋈ M| = |D| * |M| / max( |D.fk|, |M.pk| )
> {noformat}
> Which is the same as Equation 5 (assuming no selection). All good.
> h4. Join Estimation in Impala
> Join cardinality is estimated in {{JoinNode.computeStats()}}, and especially 
> in {{getJoinCardinality()}}, which implements the infamous "FK/PK" 
> heuristics. "The FK/PK detection logic is based on the assumption that most 
> joins are FK/PK. ... In the absence of relevant stats, we assume FK/PK with a 
> join selectivity of 1." As noted above, M:1 is more common name for "FK/PK".
> During execution, the code first ensures that the larger table is on the 
> probe (left) side. So, {{getJoinCardinality()}} assumes that the L is the 
> detail table with the foreign key (FK), while the smaller build-side table is 
> the master table with the primary key (PK). That is L:R is M:1, joined on 
> {{L.<FK> = R.<PK>}}.
> Next, let's translate the code's naming to the terms defined above.
> A comment in the code explains the cardinality estimation model:
> {noformat}
> cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
> {noformat}
> We can check how the code implements the expression:
> {code:java}
>       ndvRatio = slots.rhsNdv() / slots.lhsNdv();
>       double rhsSelectivity = rhsCard / slots.rhsNumRows();
>       long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
> {code}
> So, the comment accurately describes the code.
> Let’s translate the code’s expression to the terms defined above:
> {noformat}
>                  |L'|          |R'|
>               ----^-----    ----^-----
> cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
> {noformat}
> Using:
>  * {{cardinality}} here means the cardinality of the join, hence {{|𝜎D ⋈ 𝜎M|}}
>  * {{L}} = Left = Probe table (before scan) = Detail table, hence {{D}}
>  * {{R}} = Right = Build table (before scan) = Master table, hence {{M}}
>  * {{child(0)}} is the output of the scan the left table or {{D}}, hence 
> {{𝜎D}}
>  * {{child(1)}} is the output of the scan the right table or {{M}}, hence 
> {{𝜎M}}
>  * {{R.d}} is the join key of the R (master) table, hence {{M.pk}}
>  * {{L.c}} is the join key of the L (detail) table, hence {{D.fk}}
>  * NDV is, as we said, the cardinality of the key column, hence {{|k|}}.
> Gives:
> {noformat}
> |D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|).   [Sic]
> {noformat}
> The code expression is actually overly complex. Recall that, by definition, 
> {{|M| = |M.pk|}}:
> {noformat}
> |D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|)
>           = |D'| * (|M'| / |D.fk|) * (|M.pk| / |M|)
>           = |D'| * (|M'| / |D.fk|) * (|M| / |M|)
>           = |D'| * |M'| / |D.fk|
> {noformat}
> As we’ll see, this formula is *NOT* correct.
> h4. Bug in the Code
> Compare Equation 3 to the one in the code:
> {noformat}
> |D' ⋈ M'| = |D'| * |M'| / |D.fk|                — Code
> |D' ⋈ M'| = |D'| * |M'| / max(|D.fk|, |M.pk|)   — Derived
> {noformat}
> Notice the code contains a severe error: it divides by the foreign key 
> cardinality, even when that cardinality is *smaller* than that of the primary 
> key, causing the join cardinality to be larger than the probe cardinality, 
> which cannot occur if we assume a master/detail (PK/FK) relationship.
> To ensure we are reading all this correctly, let' work out an example. Let's 
> assume that our detail table has 10K rows, our master table 100 rows, and 
> that only 25 of the master records are referenced. Let's assume the scans 
> remove no rows. That is:
>  * {{|D'| = |D| = 10,000}}
>  * {{|M'| = |M| = = |M.pk| = 100}}
>  * {{|D.fk| = 25}}
> Using Equation 3:
> {noformat}
> |D' ⋈ M'| = |D'| * |M'| / max( |D.fk|, |M.pk| )
>           = |D| * |M| / max( |D.fk|, |M.pk| )
>           = 10,000 * 100 / max( 100, 25 )
>           = 10,000 * 100 / 100
>           = 10,000
> {noformat}
> Makes sense: all 10K detail rows match some master row, though 3/4 of the 
> master rows are unreferenced.
> Using the formula from the code:
> {noformat}
> cardinality = |D| * |M| / |D.fk|
>             = 10,000 * 100 / 25
>             = 10,000 * 4
>             = 40,000
> {noformat}
> As a result, the fewer of the primary keys are referenced, the more we expect 
> the input rows to be replicated and the greater the (mis)estimated output 
> cardinality. This, of course, leads to poor join selection and thus a poor 
> plan.
> Let's check this empirically by printing out the `ndvRatio` (see code snippet 
> above) when running some queries:
> {noformat}
> ndv ratio = 2.0977582147609786
> ndv ratio = 3.3183831043488135
> ndv ratio = 2.0977582147609786
> ndv ratio = 2.0977582147609786
> {noformat}
> So, this seems like a real bug.
> h4. Compound Keys
> The discussion thus far has assumed a single key column per table so that we 
> have NDV values for each. The next challenge occurs if the query has compound 
> columns:
> {noformat}
> SELECT …
> FROM T1, T2
> WHERE T1.k1a = T2.k2a AND T1.k1b = T2.k2b
> {noformat}
> We will use the symbol {{j}} to mean a joint (compound) key. The joint key 
> for table {{T1}} is {{T1.j1}} and {{||T1.j1|}} is the cardinality (NDV) of 
> that key.
> In an RDBMS, if an index exists on (k1a, k1b) and (k2a, k2b) then it is 
> possible to determine the joint NDV (which is just the number of index 
> entries.) In HMS, we have the individual NDVs, but not the joint NDV.
> The S&S paper supplies the require assumption: “Independence: Within each 
> relation, values chosen from distinct columns that are involved in join 
> predicates are independent.” This lets us estimate the joint NDV:
> One way to think of this is the the columns form a matrix, say (year, month). 
> Every year has 12 months, so there is no correlation between the two. If we 
> have data for 10 years, we will have 10 * 12 = 120 distinct key values.
> {noformat}
> |T.j| = |T.k1| * |T.k2| * … * |T.kn|
>       = ∏ |T.ki|
> {noformat}
> Where {{∏}} is the product operator. (S&S, section 3.3, calls this the 
> "Multiplicative Rule.")
> Of course, keys are not always independent. Suppose the keys are (state, 
> city). In this case, there are 50 states. But most city names occur in only 
> one state, so NDV(state) * NDV(city) >> NDV(state, city). Still, we know 
> that, regardless of the correlation, we can never have more keys than rows. 
> We can adjust the expression to:
> {noformat}
> |T.j| = min( ∏ |T.ki|, |T| )
> {noformat}
> We can also assume that joins are typically formulated to be as selective as 
> possible. So the assumption may not be too far off. Still, if a file contains 
> addresses, so the key is (state, city, street), and we just the (state, city) 
> prefix, the joint NDV will smaller than the table cardinality. So we need 
> both terms above.
> Adding this to the expression derived above:
> {noformat}
>                              |D'| * |M'|
> |D' ⋈ M'| = ---------------------------------------------      [Equation 4]
>             max(min(∏ |D.fki|, |D|), min(∏ |M.pki|, |M|))
> {noformat}
> Impala does use the multiplicative rule. In 
> {{JoinNode.getFkPkEqJoinConjuncts()}}:
> {code:java}
>       double jointNdv = 1.0;
>       for (EqJoinConjunctScanSlots slots: fkPkCandidate) jointNdv *= 
> slots.rhsNdv();
> {code}
> However, the calculations become muddled in the actual cardinality estimation 
> in {{JoinNode.getFkPkJoinCardinality()}}
> {code:java}
>     long result = -1;
>     for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
>       // Adjust the join selectivity based on the NDV ratio to avoid 
> underestimating
>       // the cardinality if the PK side has a higher NDV than the FK side.
>       double ndvRatio = 1.0;
>       if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv();
>       double rhsSelectivity = Double.MIN_VALUE;
>       if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard / 
> slots.rhsNumRows();
>       long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
>       if (result == -1) {
>         result = joinCard;
>       } else {
>         result = Math.min(result, joinCard);
>       }
>     }
> {code}
> h4. Proposed Fix
> All of this leads to the proposed fix, which can be expressed succinctly use 
> the correct expression: Equation 4 above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to