[
https://issues.apache.org/jira/browse/IMPALA-8048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy updated IMPALA-8048:
--------------------------------------
Target Version: Impala 5.0.0
> Improve join cardinality estimation: urn model, NDV tracking, etc.
> ------------------------------------------------------------------
>
> Key: IMPALA-8048
> URL: https://issues.apache.org/jira/browse/IMPALA-8048
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Priority: Major
>
> Work is underway on a number of JIRA tickets to improve cardinality
> estimates. That work is constrained by the possible need to back-port to
> prior releases. As a result, the changes are made within the existing context
> to minimize the impact.
> The current model makes a number of naive assumptions, however, that should
> be addressed in a second batch of changes which will entail a wider code
> impact.
> h4. Adopt the Urn Model for NDV Estimation.
> Suppose we have a table alumni(name, sex, class) with values such as:
> {noformat}
> John Smith, M, 2008
> Jane Doe, F, 1993
> ...
> {noformat}
> We have 50 years of data, 1000 rows per year, or 50K rows. We have these
> stats:
> {noformat}
> |alumni| = 50K
> |name| = 49K
> |sex| = 2
> |class| = 50
> {noformat}
> We have the following query which fills in the graduation date for each class:
> {code:sql}
> select * from alumni, grad_dates where sex='F' where alumni.class =
> grad_dates.class
> {code}
> Focusing just on the alumni table, how many classes will be available to
> match? That is, what is {{|class'|}}, the NDV of the name field after
> accounting for the affect of the predicate {{sex='F'}}.
> Today we work it out with a linear model as follows:
> {code}
> sel(sex = 'F') = 1/|F| = 1/2 = 0.5
> |sex'| = |sex| * sel(sex = 'F') = 2 * 0.5 = 1
> |class'| = |class| * sel(sex = 'F') = 50 * 0.5 = 25
> {code}
> The math works for the {{sex}} field: the correct adjusted NDV is 1.
> What about for {{class}}? Since the predicate eliminated half the rows, it
> eliminated half the class values. But, this can't be right. Surely women
> graduated in all classes. What went wrong?
> The problem is the linear assumption. As shown in the [SwamiI and
> Schiefer|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
> paper, Section 5, the correct estimation technique is the urn model. See the
> paper for details. Using that model:
> {noformat}
> |x'| = (1 - (1 - 1/|x|)^|T')
> |alumni'| = |alumni| * sel(sex = 'F') = 50K * .5 = 25K
> |class'| = |class| * (1 - (1 - 1/50) ^ 25K) = 50
> {noformat}
> That is, as the cardinality of the selected table grows larger, the
> probability reaches 1 that other, non-correlated values will still appear.
> This, though we remove half the rows, all the classes are still represented.
> h4. Per-Tuple Column NDV Tracking
> At present, after the current round of changes, we use a linear model to
> estimate column NDV after filtering, and use the same model for all columns.
> If we adopt the urn model, then we must treat columns separately. In the
> above, we do *not* want to apply the urn model to the {{sex}} column. Why? We
> already know its cardinality from the filter predicate. Don't want to replace
> it with an estimated urn-model value. This problem is more acute if you
> consider a range predicate, such as those used on partitions: {{class >
> 2009}}.
> To make the above work, we have to track NDV per column. That is, the scan
> node must provide a list of columns and their NDVs after scanning. Columns
> mentioned in a predicate have their NDVs estimated from selectivity. All
> other columns have their NDVs estimated from the urn model. (There are
> several ways to implement this; the point is that some columns must be
> singled out for special treatment.)
> h4. Proper Join-to-Table Join Column NDVs
> The NDV adjustment model says that, to compute the join cardinality, we need
> the adjusted column cardinality (NDV). When joining one table to another, it
> is clear how to adjust the column NDVs for each table: each is done according
> to the rules spelled out above.
> A complexity arises, however, when we want to join three tables: we have ((A
> ⋈ B) ⋈ C). How do we adjust the NDVs for the columns created by the (A ⋈ B)
> join? If we simply adjust the NDV of table columns using a common selectivity
> (as done in the simple linear model), then we are correct for the columns
> from one table, but wrong for columns from the other. Why? The two table had
> different selectivities applied, we can't reduce them to a common number.
> The solution is the per-column adjusted NDV tracking: we'd know to apply one
> set of adjustments for columns from the left table, another for the right.
> This requires additional data structures in each plan node.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]