[ 
https://issues.apache.org/jira/browse/IMPALA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731478#comment-16731478
 ] 

Paul Rogers commented on IMPALA-8014:
-------------------------------------

Example incorrect calculation:

{code:sql}
select m.c_custkey, d.o_custkey
from tpch.customer m,
     tpch.orders d
where m.c_custkey = d.o_custkey
  and m.c_name = 'foo'
{code}

Here, the {{WHERE}} clause predicate is on a unique key as is the join 
predicate. There should be only one row that matches the {{WHERE}} predicate, 
and that one row joins with the 10 (on average) orders. (Using NDV stats to 
work out these assertions.)

Yet, the planner believes that 16 rows will match:

{noformat}
---- PLAN
02:HASH JOIN [INNER JOIN]
|  hash predicates: d.o_custkey = m.c_custkey
|  fk/pk conjuncts: d.o_custkey = m.c_custkey
|  tuple-ids=1,0 row-size=46B cardinality=16
|
|--00:SCAN HDFS [tpch.customer m]
|     partitions=1/1 files=1 size=23.08MB
|     predicates: m.c_name = 'foo'
|     tuple-ids=0 row-size=38B cardinality=1
|
01:SCAN HDFS [tpch.orders d]
   partitions=1/1 files=1 size=162.56MB
   tuple-ids=1 row-size=8B cardinality=1500000
{noformat}


> Incorrect M:1 (FK/PK) cardinality estimation
> --------------------------------------------
>
>                 Key: IMPALA-8014
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8014
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> The math is wrong when we estimate the cardinality of a join we've labeled as 
> "FK/PK" (commonly known as many-to-one or M:1.)
> TL;DR: skip to the end for the proposed change.
> Join logic is complex. To ensure that the analysis is sound we work it out 
> from first principles, then verify against the Swami & Schiefer, [On the 
> Estimation of Join Result 
> Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
>  (S&S). Note especially section 3, Background.
> h4. Definitions
> The following terms are used below:
>  * _Relation:_ either a _base table_ or the result of a join. (Relational 
> theory says that both are simply relations.) We use an upper case letter, 
> typically R, for a relation. E.g.: {{R1}}, {{R1}}
>  * _Key:_ A column used in a join. We use a lower case letter for a column, 
> and typically use “k” for a key, along with the relation number. E.g. {{k1}} 
> and {{ON R1.k1a = R2.k2a}}.
>  * _Compound key:_ A join key made up of multiple columns. We use a letter to 
> denote each column. E.g. {{k1a}} and {{ON R1.k1a = R2.k2a AND R1.k1b = 
> R2.k2b}}
>  * _Join:_ the join of two relations. Impala supports the join of two base 
> tables (a table-to-table join), or the join of a base table and another join 
> (a table-to-join join.) We use the symbol {{⋈}} to denote a join. Impala 
> typically uses a hash join.
>  * _Relation cardinality:_ the number of rows in the relation, denoted 
> {{|R|}}.
>  * _Join cardinality:_ the cardinality of the relation produced by the join. 
> That is, the join’s output cardinality. Denoted as {{|R1 ⋈ R2|}}.
>  * _Key column cardinality:_ the number of unique values in a column, denoted 
> {{|k|}}. Also known as the NDV of the column or {{ndv(k)}}.
> In Impala, HMS provides the relation and join key cardinality as table and 
> column statistics respectively.
> Since Impala typically uses hash joins, it is helpful to use terminology 
> specific to that case:
>  * _Probe_ side of a join: the larger side (ideally) of a hash join. Also 
> called the “left” side. Is known as {{chiild(0)}} in the code. Appears at the 
> same level as the join in the query plan (which would be to the left if the 
> plan were rotated 90 degrees counter-clockwise.)
>  * _Build_ side of a join: the smaller side (ideally) of a hash join. Also 
> called the “right” side. Is known as {{child(1)}} in the code. Appears 
> indented directly under the join in the query plan.
> In Impala:
>  * The detail table in a M:1 relationship is always on the probe (left) side 
> of a join. Represented as {{child(0)}} in the code.
>  * The master table in a M:1 relationship is always on the build (right) side 
> of a join. Represented as {{child(1)}} in the code.
> Finally, we also need:
>  * The _scan_ of a table: the (possibly smaller) relation produced by 
> applying one or more predicates while scanning a table. We are concerned with 
> the cardinality of the scan, denoted as {{|R1'|}}. We assume that Impala has 
> already used rules (not discussed here) to estimate the cardinality of the 
> selection.
> h4. Deriving the Join Formula
> In RDBMS, a primary key (PK) is a column (or, more typically, set of columns) 
> that uniquely identify a row in the master table (M). Primary keys are 
> generally indexed via a unique index. Since keys are unique:
> {noformat}
> |M.pk| = |M|
> {noformat}
> The Detail table forms a M:1 relationship with the master table. Each foreign 
> key (FK) in the detail table (D) references one primary key in the master 
> table. Because of the M:1 relationship:
> {noformat}
> |D.fk| <= |M| << |D|
> {noformat}
> If we read all rows from both tables, and all primary keys appear as foreign 
> keys, the the join cardinality is simply:
> {noformat}
> |D ⋈ M| = |D|               [Equation 1]
> {noformat}
> h5. Filtering Master Rows
> Let’s consider what happens when we filter out master (M) table rows 
> producing a subset M'. To do so, we make three assumptions:
>  * A uniform distribution of foreign keys, (the “Uniformity” assumption in 
> the S&S paper cited above),
>  * Every primary key is referenced by an equal number of foreign keys 
> (implied by the uniformity assumption),
>  * The filter results in a random sampling of master rows not correlated with 
> the join keys.
> Then the probability of any particular primary key appearing is:
> {noformat}
> p(pk) = |M'| / |M|
> {noformat}
> The result value is a probability (hence the {{p()}} function) given by the 
> ratio of selected rows to total table rows (from basic probability theory.) 
> The value runs from 0 (no master rows match the scan predicates) to 1 (all 
> rows present).
> The revised join cardinality is:
> {noformat}
> |D ⋈ M'| = |D| * p(pk)
>          = |D| * |M'| / |M|                [Equation 2]
> {noformat}
> h5. Filtering Detail Rows
> Suppose we instead filter detail rows to produce a new subset D'. Again we 
> make some assumptions:
>  * Uniform distribution of foreign key values across primary keys.
>  * The "Containment" assumption from the S&S paper that the set of foreign 
> keys is a (possibly full) subset of the set of primary keys.
>  * The filter results in a random sampling of detail rows not correlated with 
> the join keys.
> With these assumptions, we can see that if we join D' with the master table 
> M, every row that remains in D' will still find a match in M, so:
> {noformat}
> |D' ⋈ M| = |D'|
> {noformat}
> h5. Filtering Both Master and Detail Rows
> Combining the two selection models produces:
> {noformat}
> |D' ⋈ M'| = |D'| * |M'| / |M|           [Equation 3]
> {noformat}
> h5. Non-Containment
> The above is based on the S&S Containment assumption: that the set of foreign 
> keys is a subset of the set of primary keys. In a normal RDBMS with integrity 
> constraints, this is a valid assumption. But, in Big Data, things are messy 
> and we can’t actually make this assumption. Fortunately, there is a way 
> loosen the containment assumption.
> Suppose the master table has half the number of keys as the detail table. 
> Using the uniformity assumption, half the foreign keys will go unmatched. If 
> there are four times as many foreign keys as primary keys, only a quarter of 
> the detail rows will find matches. The probability of any one foreign key 
> finding a match is:
> {noformat}
> p(match) = |M.pk| / |D.fk|
> {noformat}
> The above applies only if {{|D.fk| > |M.pk|}}; no adjustment is needed if 
> {{|D.fk| < |M.pk|}}. We can express this mathematically as:
> {noformat}
> p(match) = |M.pk| / max(|D.fk|, |M.pk|)       [Equation 4]
> {noformat}
> Combining this with Equation 3, recalling that {{|M.pk| = |M|}}:
> {noformat}
> |D' ⋈ M'| = |D'| * (|M'| / |M|) * p(match)
>           = |D'| * |M'| / |M| * |M.pk| / max(|D.fk|, |M.pk|)
>           = |D'| * |M'| * (|M| / |M|) / max(|D.fk|, |M.pk|)
>           = |D'| * |M'| / max(|D.fk|, |M|)  
> {noformat}
> h5. Complete Expression
> If we rearrange the above we get the complete expression:
> {noformat}
>                |D'| * |M'| 
> |D' ⋈ M'| = ----------------                  [Equation 5]
>             max(|D.fk|, |M|)
> {noformat}
> h4. The S&S Equation
> Let’s check out answer against the S&S paper cited above by assuming the 
> master/detail relationship and removing from our equation the affect of 
> scans. The paper’s equation:
> {noformat}
> |R1 ⋈ R2| = min( |k1|, |k2| ) * ( |R1| / |k1| ) * ( |R2| * |k2| )
>           = |R1| * |R2| / max( |k1|, |k2| )
> {noformat}
> If we assume:
>  * {{R1 = D}}
>  * {{R2 = M}}
>  * {{k1 = D.fk}}
>  * {{k2 = M.pk}}
>  * {{|D.fk| > |M.pk|}}
> We get:
> {noformat}
> |D ⋈ M| = |D| * |M| / max( |D.fk|, |M.pk| )
> {noformat}
> Which is the same as Equation 5 (assuming no selection). All good.
> h4. Join Estimation in Impala
> Join cardinality is estimated in {{JoinNode.computeStats()}}, and especially 
> in {{getJoinCardinality()}}, which implements the infamous "FK/PK" 
> heuristics. "The FK/PK detection logic is based on the assumption that most 
> joins are FK/PK. ... In the absence of relevant stats, we assume FK/PK with a 
> join selectivity of 1." As noted above, M:1 is more common name for "FK/PK".
> During execution, the code first ensures that the larger table is on the 
> probe (left) side. So, {{getJoinCardinality()}} assumes that the L is the 
> detail table with the foreign key (FK), while the smaller build-side table is 
> the master table with the primary key (PK). That is L:R is M:1, joined on 
> {{L.<FK> = R.<PK>}}.
> Next, let's translate the code's naming to the terms defined above.
> A comment in the code explains the cardinality estimation model:
> {noformat}
> cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
> {noformat}
> We can check how the code implements the expression:
> {code:java}
>       ndvRatio = slots.rhsNdv() / slots.lhsNdv();
>       double rhsSelectivity = rhsCard / slots.rhsNumRows();
>       long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
> {code}
> So, the comment accurately describes the code.
> Let’s translate the code’s expression to the terms defined above:
> {noformat}
>                  |L'|          |R'|
>               ----^-----    ----^-----
> cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
> {noformat}
> Using:
>  * {{cardinality}} here means the cardinality of the join, hence {{|𝜎D ⋈ 𝜎M|}}
>  * {{L}} = Left = Probe table (before scan) = Detail table, hence {{D}}
>  * {{R}} = Right = Build table (before scan) = Master table, hence {{M}}
>  * {{child(0)}} is the output of the scan the left table or {{D}}, hence 
> {{𝜎D}}
>  * {{child(1)}} is the output of the scan the right table or {{M}}, hence 
> {{𝜎M}}
>  * {{R.d}} is the join key of the R (master) table, hence {{M.pk}}
>  * {{L.c}} is the join key of the L (detail) table, hence {{D.fk}}
>  * NDV is, as we said, the cardinality of the key column, hence {{|k|}}.
> Gives:
> {noformat}
> |D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|).   [Sic]
> {noformat}
> The code expression is actually overly complex. Recall that, by definition, 
> {{|M| = |M.pk|}}:
> {noformat}
> |D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|)
>           = |D'| * (|M'| / |D.fk|) * (|M.pk| / |M|)
>           = |D'| * (|M'| / |D.fk|) * (|M| / |M|)
>           = |D'| * |M'| / |D.fk|
> {noformat}
> As we’ll see, this formula is *NOT* correct.
> h4. Bug in the Code
> Compare Equation 3 to the one in the code:
> {noformat}
> |D' ⋈ M'| = |D'| * |M'| / |D.fk|                — Code
> |D' ⋈ M'| = |D'| * |M'| / max(|D.fk|, |M.pk|)   — Derived
> {noformat}
> Notice the code contains a severe error: it divides by the foreign key 
> cardinality, even when that cardinality is *smaller* than that of the primary 
> key, causing the join cardinality to be larger than the probe cardinality, 
> which cannot occur if we assume a master/detail (PK/FK) relationship.
> To ensure we are reading all this correctly, let' work out an example. Let's 
> assume that our detail table has 10K rows, our master table 100 rows, and 
> that only 25 of the master records are referenced. Let's assume the scans 
> remove no rows. That is:
>  * {{|D'| = |D| = 10,000}}
>  * {{|M'| = |M| = = |M.pk| = 100}}
>  * {{|D.fk| = 25}}
> Using Equation 3:
> {noformat}
> |D' ⋈ M'| = |D'| * |M'| / max( |D.fk|, |M.pk| )
>           = |D| * |M| / max( |D.fk|, |M.pk| )
>           = 10,000 * 100 / max( 100, 25 )
>           = 10,000 * 100 / 100
>           = 10,000
> {noformat}
> Makes sense: all 10K detail rows match some master row, though 3/4 of the 
> master rows are unreferenced.
> Using the formula from the code:
> {noformat}
> cardinality = |D| * |M| / |D.fk|
>             = 10,000 * 100 / 25
>             = 10,000 * 4
>             = 40,000
> {noformat}
> As a result, the fewer of the primary keys are referenced, the more we expect 
> the input rows to be replicated and the greater the (mis)estimated output 
> cardinality. This, of course, leads to poor join selection and thus a poor 
> plan.
> Let's check this empirically by printing out the `ndvRatio` (see code snippet 
> above) when running some queries:
> {noformat}
> ndv ratio = 2.0977582147609786
> ndv ratio = 3.3183831043488135
> ndv ratio = 2.0977582147609786
> ndv ratio = 2.0977582147609786
> {noformat}
> So, this seems like a real bug.
> h4. Compound Keys
> The discussion thus far has assumed a single key column per table so that we 
> have NDV values for each. The next challenge occurs if the query has compound 
> columns:
> {noformat}
> SELECT …
> FROM T1, T2
> WHERE T1.k1a = T2.k2a AND T1.k1b = T2.k2b
> {noformat}
> We will use the symbol {{j}} to mean a joint (compound) key. The joint key 
> for table {{T1}} is {{T1.j1}} and {{||T1.j1|}} is the cardinality (NDV) of 
> that key.
> In an RDBMS, if an index exists on (k1a, k1b) and (k2a, k2b) then it is 
> possible to determine the joint NDV (which is just the number of index 
> entries.) In HMS, we have the individual NDVs, but not the joint NDV.
> The S&S paper supplies the require assumption: “Independence: Within each 
> relation, values chosen from distinct columns that are involved in join 
> predicates are independent.” This lets us estimate the joint NDV:
> One way to think of this is the the columns form a matrix, say (year, month). 
> Every year has 12 months, so there is no correlation between the two. If we 
> have data for 10 years, we will have 10 * 12 = 120 distinct key values.
> {noformat}
> |T.j| = |T.k1| * |T.k2| * … * |T.kn|
>       = ∏ |T.ki|
> {noformat}
> Where {{∏}} is the product operator. (S&S, section 3.3, calls this the 
> "Multiplicative Rule.")
> Of course, keys are not always independent. Suppose the keys are (state, 
> city). In this case, there are 50 states. But most city names occur in only 
> one state, so NDV(state) * NDV(city) >> NDV(state, city). Still, we know 
> that, regardless of the correlation, we can never have more keys than rows. 
> We can adjust the expression to:
> {noformat}
> |T.j| = min( ∏ |T.ki|, |T| )
> {noformat}
> We can also assume that joins are typically formulated to be as selective as 
> possible. So the assumption may not be too far off. Still, if a file contains 
> addresses, so the key is (state, city, street), and we just the (state, city) 
> prefix, the joint NDV will smaller than the table cardinality. So we need 
> both terms above.
> Adding this to the expression derived above:
> {noformat}
>                              |D'| * |M'|
> |D' ⋈ M'| = ---------------------------------------------      [Equation 6]
>             max(min(∏ |D.fki|, |D|), min(∏ |M.pki|, |M|))
> {noformat}
> Impala does use the multiplicative rule. In 
> {{JoinNode.getFkPkEqJoinConjuncts()}}:
> {code:java}
>       double jointNdv = 1.0;
>       for (EqJoinConjunctScanSlots slots: fkPkCandidate) jointNdv *= 
> slots.rhsNdv();
> {code}
> However, the calculations become muddled in the actual cardinality estimation 
> in {{JoinNode.getFkPkJoinCardinality()}}
> {code:java}
>     long result = -1;
>     for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
>       // Adjust the join selectivity based on the NDV ratio to avoid 
> underestimating
>       // the cardinality if the PK side has a higher NDV than the FK side.
>       double ndvRatio = 1.0;
>       if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv();
>       double rhsSelectivity = Double.MIN_VALUE;
>       if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard / 
> slots.rhsNumRows();
>       long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
>       if (result == -1) {
>         result = joinCard;
>       } else {
>         result = Math.min(result, joinCard);
>       }
>     }
> {code}
> h4. Proposed Fix
> All of this leads to the proposed fix, which can be expressed succinctly use 
> the correct expression: Equation 6 above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to