[
https://issues.apache.org/jira/browse/IMPALA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731478#comment-16731478
]
Paul Rogers commented on IMPALA-8014:
-------------------------------------
Example incorrect calculation:
{code:sql}
select m.c_custkey, d.o_custkey
from tpch.customer m,
tpch.orders d
where m.c_custkey = d.o_custkey
and m.c_name = 'foo'
{code}
Here, the {{WHERE}} clause predicate is on a unique key as is the join
predicate. There should be only one row that matches the {{WHERE}} predicate,
and that one row joins with the 10 (on average) orders. (Using NDV stats to
work out these assertions.)
Yet, the planner believes that 16 rows will match:
{noformat}
---- PLAN
02:HASH JOIN [INNER JOIN]
| hash predicates: d.o_custkey = m.c_custkey
| fk/pk conjuncts: d.o_custkey = m.c_custkey
| tuple-ids=1,0 row-size=46B cardinality=16
|
|--00:SCAN HDFS [tpch.customer m]
| partitions=1/1 files=1 size=23.08MB
| predicates: m.c_name = 'foo'
| tuple-ids=0 row-size=38B cardinality=1
|
01:SCAN HDFS [tpch.orders d]
partitions=1/1 files=1 size=162.56MB
tuple-ids=1 row-size=8B cardinality=1500000
{noformat}
> Incorrect M:1 (FK/PK) cardinality estimation
> --------------------------------------------
>
> Key: IMPALA-8014
> URL: https://issues.apache.org/jira/browse/IMPALA-8014
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
>
> The math is wrong when we estimate the cardinality of a join we've labeled as
> "FK/PK" (commonly known as many-to-one or M:1.)
> TL;DR: skip to the end for the proposed change.
> Join logic is complex. To ensure that the analysis is sound we work it out
> from first principles, then verify against the Swami & Schiefer, [On the
> Estimation of Join Result
> Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
> (S&S). Note especially section 3, Background.
> h4. Definitions
> The following terms are used below:
> * _Relation:_ either a _base table_ or the result of a join. (Relational
> theory says that both are simply relations.) We use an upper case letter,
> typically R, for a relation. E.g.: {{R1}}, {{R1}}
> * _Key:_ A column used in a join. We use a lower case letter for a column,
> and typically use “k” for a key, along with the relation number. E.g. {{k1}}
> and {{ON R1.k1a = R2.k2a}}.
> * _Compound key:_ A join key made up of multiple columns. We use a letter to
> denote each column. E.g. {{k1a}} and {{ON R1.k1a = R2.k2a AND R1.k1b =
> R2.k2b}}
> * _Join:_ the join of two relations. Impala supports the join of two base
> tables (a table-to-table join), or the join of a base table and another join
> (a table-to-join join.) We use the symbol {{⋈}} to denote a join. Impala
> typically uses a hash join.
> * _Relation cardinality:_ the number of rows in the relation, denoted
> {{|R|}}.
> * _Join cardinality:_ the cardinality of the relation produced by the join.
> That is, the join’s output cardinality. Denoted as {{|R1 ⋈ R2|}}.
> * _Key column cardinality:_ the number of unique values in a column, denoted
> {{|k|}}. Also known as the NDV of the column or {{ndv(k)}}.
> In Impala, HMS provides the relation and join key cardinality as table and
> column statistics respectively.
> Since Impala typically uses hash joins, it is helpful to use terminology
> specific to that case:
> * _Probe_ side of a join: the larger side (ideally) of a hash join. Also
> called the “left” side. Is known as {{chiild(0)}} in the code. Appears at the
> same level as the join in the query plan (which would be to the left if the
> plan were rotated 90 degrees counter-clockwise.)
> * _Build_ side of a join: the smaller side (ideally) of a hash join. Also
> called the “right” side. Is known as {{child(1)}} in the code. Appears
> indented directly under the join in the query plan.
> In Impala:
> * The detail table in a M:1 relationship is always on the probe (left) side
> of a join. Represented as {{child(0)}} in the code.
> * The master table in a M:1 relationship is always on the build (right) side
> of a join. Represented as {{child(1)}} in the code.
> Finally, we also need:
> * The _scan_ of a table: the (possibly smaller) relation produced by
> applying one or more predicates while scanning a table. We are concerned with
> the cardinality of the scan, denoted as {{|R1'|}}. We assume that Impala has
> already used rules (not discussed here) to estimate the cardinality of the
> selection.
> h4. Deriving the Join Formula
> In RDBMS, a primary key (PK) is a column (or, more typically, set of columns)
> that uniquely identify a row in the master table (M). Primary keys are
> generally indexed via a unique index. Since keys are unique:
> {noformat}
> |M.pk| = |M|
> {noformat}
> The Detail table forms a M:1 relationship with the master table. Each foreign
> key (FK) in the detail table (D) references one primary key in the master
> table. Because of the M:1 relationship:
> {noformat}
> |D.fk| <= |M| << |D|
> {noformat}
> If we read all rows from both tables, and all primary keys appear as foreign
> keys, the the join cardinality is simply:
> {noformat}
> |D ⋈ M| = |D| [Equation 1]
> {noformat}
> h5. Filtering Master Rows
> Let’s consider what happens when we filter out master (M) table rows
> producing a subset M'. To do so, we make three assumptions:
> * A uniform distribution of foreign keys, (the “Uniformity” assumption in
> the S&S paper cited above),
> * Every primary key is referenced by an equal number of foreign keys
> (implied by the uniformity assumption),
> * The filter results in a random sampling of master rows not correlated with
> the join keys.
> Then the probability of any particular primary key appearing is:
> {noformat}
> p(pk) = |M'| / |M|
> {noformat}
> The result value is a probability (hence the {{p()}} function) given by the
> ratio of selected rows to total table rows (from basic probability theory.)
> The value runs from 0 (no master rows match the scan predicates) to 1 (all
> rows present).
> The revised join cardinality is:
> {noformat}
> |D ⋈ M'| = |D| * p(pk)
> = |D| * |M'| / |M| [Equation 2]
> {noformat}
> h5. Filtering Detail Rows
> Suppose we instead filter detail rows to produce a new subset D'. Again we
> make some assumptions:
> * Uniform distribution of foreign key values across primary keys.
> * The "Containment" assumption from the S&S paper that the set of foreign
> keys is a (possibly full) subset of the set of primary keys.
> * The filter results in a random sampling of detail rows not correlated with
> the join keys.
> With these assumptions, we can see that if we join D' with the master table
> M, every row that remains in D' will still find a match in M, so:
> {noformat}
> |D' ⋈ M| = |D'|
> {noformat}
> h5. Filtering Both Master and Detail Rows
> Combining the two selection models produces:
> {noformat}
> |D' ⋈ M'| = |D'| * |M'| / |M| [Equation 3]
> {noformat}
> h5. Non-Containment
> The above is based on the S&S Containment assumption: that the set of foreign
> keys is a subset of the set of primary keys. In a normal RDBMS with integrity
> constraints, this is a valid assumption. But, in Big Data, things are messy
> and we can’t actually make this assumption. Fortunately, there is a way
> loosen the containment assumption.
> Suppose the master table has half the number of keys as the detail table.
> Using the uniformity assumption, half the foreign keys will go unmatched. If
> there are four times as many foreign keys as primary keys, only a quarter of
> the detail rows will find matches. The probability of any one foreign key
> finding a match is:
> {noformat}
> p(match) = |M.pk| / |D.fk|
> {noformat}
> The above applies only if {{|D.fk| > |M.pk|}}; no adjustment is needed if
> {{|D.fk| < |M.pk|}}. We can express this mathematically as:
> {noformat}
> p(match) = |M.pk| / max(|D.fk|, |M.pk|) [Equation 4]
> {noformat}
> Combining this with Equation 3, recalling that {{|M.pk| = |M|}}:
> {noformat}
> |D' ⋈ M'| = |D'| * (|M'| / |M|) * p(match)
> = |D'| * |M'| / |M| * |M.pk| / max(|D.fk|, |M.pk|)
> = |D'| * |M'| * (|M| / |M|) / max(|D.fk|, |M.pk|)
> = |D'| * |M'| / max(|D.fk|, |M|)
> {noformat}
> h5. Complete Expression
> If we rearrange the above we get the complete expression:
> {noformat}
> |D'| * |M'|
> |D' ⋈ M'| = ---------------- [Equation 5]
> max(|D.fk|, |M|)
> {noformat}
> h4. The S&S Equation
> Let’s check out answer against the S&S paper cited above by assuming the
> master/detail relationship and removing from our equation the affect of
> scans. The paper’s equation:
> {noformat}
> |R1 ⋈ R2| = min( |k1|, |k2| ) * ( |R1| / |k1| ) * ( |R2| * |k2| )
> = |R1| * |R2| / max( |k1|, |k2| )
> {noformat}
> If we assume:
> * {{R1 = D}}
> * {{R2 = M}}
> * {{k1 = D.fk}}
> * {{k2 = M.pk}}
> * {{|D.fk| > |M.pk|}}
> We get:
> {noformat}
> |D ⋈ M| = |D| * |M| / max( |D.fk|, |M.pk| )
> {noformat}
> Which is the same as Equation 5 (assuming no selection). All good.
> h4. Join Estimation in Impala
> Join cardinality is estimated in {{JoinNode.computeStats()}}, and especially
> in {{getJoinCardinality()}}, which implements the infamous "FK/PK"
> heuristics. "The FK/PK detection logic is based on the assumption that most
> joins are FK/PK. ... In the absence of relevant stats, we assume FK/PK with a
> join selectivity of 1." As noted above, M:1 is more common name for "FK/PK".
> During execution, the code first ensures that the larger table is on the
> probe (left) side. So, {{getJoinCardinality()}} assumes that the L is the
> detail table with the foreign key (FK), while the smaller build-side table is
> the master table with the primary key (PK). That is L:R is M:1, joined on
> {{L.<FK> = R.<PK>}}.
> Next, let's translate the code's naming to the terms defined above.
> A comment in the code explains the cardinality estimation model:
> {noformat}
> cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
> {noformat}
> We can check how the code implements the expression:
> {code:java}
> ndvRatio = slots.rhsNdv() / slots.lhsNdv();
> double rhsSelectivity = rhsCard / slots.rhsNumRows();
> long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
> {code}
> So, the comment accurately describes the code.
> Let’s translate the code’s expression to the terms defined above:
> {noformat}
> |L'| |R'|
> ----^----- ----^-----
> cardinality = |child(0)| * (|child(1)| / |R|) * (NDV(R.d) / NDV(L.c))
> {noformat}
> Using:
> * {{cardinality}} here means the cardinality of the join, hence {{|𝜎D ⋈ 𝜎M|}}
> * {{L}} = Left = Probe table (before scan) = Detail table, hence {{D}}
> * {{R}} = Right = Build table (before scan) = Master table, hence {{M}}
> * {{child(0)}} is the output of the scan the left table or {{D}}, hence
> {{𝜎D}}
> * {{child(1)}} is the output of the scan the right table or {{M}}, hence
> {{𝜎M}}
> * {{R.d}} is the join key of the R (master) table, hence {{M.pk}}
> * {{L.c}} is the join key of the L (detail) table, hence {{D.fk}}
> * NDV is, as we said, the cardinality of the key column, hence {{|k|}}.
> Gives:
> {noformat}
> |D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|). [Sic]
> {noformat}
> The code expression is actually overly complex. Recall that, by definition,
> {{|M| = |M.pk|}}:
> {noformat}
> |D' ⋈ M'| = |D'| * (|M'| / |M|) * (|M.pk| / |D.fk|)
> = |D'| * (|M'| / |D.fk|) * (|M.pk| / |M|)
> = |D'| * (|M'| / |D.fk|) * (|M| / |M|)
> = |D'| * |M'| / |D.fk|
> {noformat}
> As we’ll see, this formula is *NOT* correct.
> h4. Bug in the Code
> Compare Equation 3 to the one in the code:
> {noformat}
> |D' ⋈ M'| = |D'| * |M'| / |D.fk| — Code
> |D' ⋈ M'| = |D'| * |M'| / max(|D.fk|, |M.pk|) — Derived
> {noformat}
> Notice the code contains a severe error: it divides by the foreign key
> cardinality, even when that cardinality is *smaller* than that of the primary
> key, causing the join cardinality to be larger than the probe cardinality,
> which cannot occur if we assume a master/detail (PK/FK) relationship.
> To ensure we are reading all this correctly, let' work out an example. Let's
> assume that our detail table has 10K rows, our master table 100 rows, and
> that only 25 of the master records are referenced. Let's assume the scans
> remove no rows. That is:
> * {{|D'| = |D| = 10,000}}
> * {{|M'| = |M| = = |M.pk| = 100}}
> * {{|D.fk| = 25}}
> Using Equation 3:
> {noformat}
> |D' ⋈ M'| = |D'| * |M'| / max( |D.fk|, |M.pk| )
> = |D| * |M| / max( |D.fk|, |M.pk| )
> = 10,000 * 100 / max( 100, 25 )
> = 10,000 * 100 / 100
> = 10,000
> {noformat}
> Makes sense: all 10K detail rows match some master row, though 3/4 of the
> master rows are unreferenced.
> Using the formula from the code:
> {noformat}
> cardinality = |D| * |M| / |D.fk|
> = 10,000 * 100 / 25
> = 10,000 * 4
> = 40,000
> {noformat}
> As a result, the fewer of the primary keys are referenced, the more we expect
> the input rows to be replicated and the greater the (mis)estimated output
> cardinality. This, of course, leads to poor join selection and thus a poor
> plan.
> Let's check this empirically by printing out the `ndvRatio` (see code snippet
> above) when running some queries:
> {noformat}
> ndv ratio = 2.0977582147609786
> ndv ratio = 3.3183831043488135
> ndv ratio = 2.0977582147609786
> ndv ratio = 2.0977582147609786
> {noformat}
> So, this seems like a real bug.
> h4. Compound Keys
> The discussion thus far has assumed a single key column per table so that we
> have NDV values for each. The next challenge occurs if the query has compound
> columns:
> {noformat}
> SELECT …
> FROM T1, T2
> WHERE T1.k1a = T2.k2a AND T1.k1b = T2.k2b
> {noformat}
> We will use the symbol {{j}} to mean a joint (compound) key. The joint key
> for table {{T1}} is {{T1.j1}} and {{||T1.j1|}} is the cardinality (NDV) of
> that key.
> In an RDBMS, if an index exists on (k1a, k1b) and (k2a, k2b) then it is
> possible to determine the joint NDV (which is just the number of index
> entries.) In HMS, we have the individual NDVs, but not the joint NDV.
> The S&S paper supplies the require assumption: “Independence: Within each
> relation, values chosen from distinct columns that are involved in join
> predicates are independent.” This lets us estimate the joint NDV:
> One way to think of this is the the columns form a matrix, say (year, month).
> Every year has 12 months, so there is no correlation between the two. If we
> have data for 10 years, we will have 10 * 12 = 120 distinct key values.
> {noformat}
> |T.j| = |T.k1| * |T.k2| * … * |T.kn|
> = ∏ |T.ki|
> {noformat}
> Where {{∏}} is the product operator. (S&S, section 3.3, calls this the
> "Multiplicative Rule.")
> Of course, keys are not always independent. Suppose the keys are (state,
> city). In this case, there are 50 states. But most city names occur in only
> one state, so NDV(state) * NDV(city) >> NDV(state, city). Still, we know
> that, regardless of the correlation, we can never have more keys than rows.
> We can adjust the expression to:
> {noformat}
> |T.j| = min( ∏ |T.ki|, |T| )
> {noformat}
> We can also assume that joins are typically formulated to be as selective as
> possible. So the assumption may not be too far off. Still, if a file contains
> addresses, so the key is (state, city, street), and we just the (state, city)
> prefix, the joint NDV will smaller than the table cardinality. So we need
> both terms above.
> Adding this to the expression derived above:
> {noformat}
> |D'| * |M'|
> |D' ⋈ M'| = --------------------------------------------- [Equation 6]
> max(min(∏ |D.fki|, |D|), min(∏ |M.pki|, |M|))
> {noformat}
> Impala does use the multiplicative rule. In
> {{JoinNode.getFkPkEqJoinConjuncts()}}:
> {code:java}
> double jointNdv = 1.0;
> for (EqJoinConjunctScanSlots slots: fkPkCandidate) jointNdv *=
> slots.rhsNdv();
> {code}
> However, the calculations become muddled in the actual cardinality estimation
> in {{JoinNode.getFkPkJoinCardinality()}}
> {code:java}
> long result = -1;
> for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
> // Adjust the join selectivity based on the NDV ratio to avoid
> underestimating
> // the cardinality if the PK side has a higher NDV than the FK side.
> double ndvRatio = 1.0;
> if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv();
> double rhsSelectivity = Double.MIN_VALUE;
> if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard /
> slots.rhsNumRows();
> long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio);
> if (result == -1) {
> result = joinCard;
> } else {
> result = Math.min(result, joinCard);
> }
> }
> {code}
> h4. Proposed Fix
> All of this leads to the proposed fix, which can be expressed succinctly use
> the correct expression: Equation 6 above.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]