[
https://issues.apache.org/jira/browse/IMPALA-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy updated IMPALA-7601:
--------------------------------------
Target Version: Impala 5.0.0
> Improve cardinality and selectivity estimates
> ---------------------------------------------
>
> Key: IMPALA-7601
> URL: https://issues.apache.org/jira/browse/IMPALA-7601
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 3.0
> Reporter: Paul Rogers
> Priority: Major
>
> Impala makes extensive use of table stats during query planning. For example,
> the NDV (number of distinct values) is used to compute _selectivity_, the
> degree of reduction (also called the _reduction factor_) provided by a
> predicate. For example:
> {noformat}
> SELECT * FROM t WHERE t.a = 10
> {noformat}
> If we know that {{t.a}} has an NDV=100, then we can predict (given a uniform
> distribution of values), that the above query will pick out one of these 100
> values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity
> of the predicate {{t.a = 10}} is 0.01.
> Selectivity is then used to compute _cardinality_ the number of rows in some
> relation (set of tuples.) Cardinality is used to pick among plan options,
> such as which relation to put on the probe vs. build side of a join.
> This ticket explains several cases in which Impala fails to produce a
> cardinality estimate, and proposes solutions.
> h4. Problem Summary
> The goal of a planner is to estimate the cardinality of various relations
> (sets of tuples) in order to choose a good plan (which relation to put on the
> probe side of a join) and for memory estimates (about how much memory is
> needed for a sort?) When planning, even crude estimates are fine as the
> planner only chooses among a discrete set of options. The smaller table goes
> on the probe side of a join, it does not matter _how much_ smaller one table
> is than the other.
> It turns out, however, that Impala is quite finicky: refusing to render an
> cardinality estimate if certain information is missing. This means that,
> rather than use a poor estimate, Impala would rather use no estimate at all.
> Rather than produce a crude attempt at a plan, Impala flies blind in this
> cases.
> You can see this in
> [{{PlanNode.computeCombinedSelectivity()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/PlanNode.java].
> The result is that Impala is a bit more strict than classic DB optimizers. If
> stats are present, they are used. If stats are not present, Impala flies
> blind.
> This ticket proposal a number of interrelated changes to add a-priori (before
> observation) defaults for selectivity and NDV based on classic DB practice.
> Also, some of our existing estimates are overly simplified as discussed below.
> h4. A Bit of Math
> Selectivity is a probability: the probability that an Boolean expression
> evaluates to true. That is:
> {noformat}
> selectivity c = x = p( (c = x) = true) = p(c = x)
> {noformat}
> Where {{c}} is a column and {{x}} is a constant.
> Knowing this can help pick good selectivity values, and makes the following
> discussion a bit simpler.
> h4. Use NDV to Compute {{p(c != x)}}
> Impala uses NDV to compute {{p(c=x)}}:
> {noformat}
> p(c = x) = 1 / NDV(c)
> {noformat}
> As described in IMPALA-7560, Impala uses a selectivity of 0.1 for {{p(c !=
> x)}}. There is, however, a mathematical relationship between these operators
> we can exploit:
> {noformat}
> p(c != x) = 1 - p(c = x)
> {noformat}
> That is, if we have {{NDV(c)}} and can compute {{p(c = x)}}, we can also
> compute {{p(c != x)}}.
> h4. Refine Default Selectivity Values
> Further, Impala assumes a selectivity 0.1 for all other relational operators
> except equality. See
> [Impala|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Expr.java]:
> {noformat}
> // To be used where we cannot come up with a
> // better estimate (selectivity_ is -1).
> public static double DEFAULT_SELECTIVITY = 0.1;
> {noformat}
> There is another mathematical relationship between these operators we can
> exploit to refine some of these estimates:
> {noformat}
> p(c != x) = p(c < x) + p(c > x)
> {noformat}
> As it turns out, we don't need the NDV to compute an inequality. We can
> simply observe that the inequalities roughly divide the data into two sets,
> of which the user picks one. The current default of 0.1 is probably too low,
> and 0.5 is probably too high. The [Ramakrishnan and Gehrke
> book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
> suggests rule-of-thumb estimates:
> * {{p(c < x)}}, {{p(c <= x)}}, {{p(c > x)}}, {{p(c >= x)}} - 0.3
> * {{p(c BETWEEN x AND y)}} - 0.25
> h4. Default Selectivity for {{p(c = x)}} and {{p(c != x)}} Without Stats
> All this is good. But, what happens if statistics are not available for table
> {{t}}? How are we to know the selectivity of the predicate?
> Today, Impala simply refuses to pick a selectivity for {{p(c = x)}} if there
> are no stats. See
> [{{BinaryPredicate.analyzeImpl()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java].
> However, even without stats, Impala continues to use its defaults for all
> other relational operators. While it is true that, without NDV, we can't be
> positive of the selectivity of {{p(c = x)}}, it is also true we are happy to
> guess for all other operators.
> The [Ramakrishnan and Gehrke
> book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
> shows that the standard choice is to assume a selectivity of 0.1.
> Over in the Drill project, DRILL-5254 attempted to work out better estimates
> for other operators based on math and probability. However, the conclusion
> there was that, without NDV and histograms, there is more information in the
> user's intent than in the math. That is, if the user writes {{WHERE t.a \!=
> 10}}, there is a conditional probability that the user believes that this is
> a highly restrictive predicate, especially on big data. So, the reduction
> factor (which is a probability) is the same for {{=}} and {{!=}} in the
> absence of information. The same reasoning probably led to the rule-of-thumb
> values in the
> [R&G|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
> book.
> So, if no stats are available, use a default number for {{p(c != x)}} such as
> 0.1 or 0.7. (The larger number is probably better.)
> As a result of this change, we will always have a selectivity, even if just a
> rule-of-thumb one. That is, some information is better than none.
> h4. Special Case Boolean Columns
> There is one special case it may be worth exploiting. If a column is Boolean,
> then we know that it can have at most three values (true, false, null). So,
> even without status, we can estimate {{NDV(c)}} = 3. This allows us to get
> better estimates for {{p(c = x)}} and {{p(c != x)}} than we'd get just using
> the defaults.
> h4. Separate the NDV=0 and Unknown NDV cases
> As described in IMPALA-7310, Impala does not currently handle the case of a
> table with stats, but a column in the table contains only NULL values. The
> stats mechanism defines NDV as "number of distinct non-null values". So, a
> table full of nulls has NDV = 0. Unfortunately, if no stats are available at
> all, we also have NDV = 0.
> IMPALA-7310 will fix this issue, removing one more case in which Impala did
> not produce a cardinality estimate.
> h4. Rationalize the Planner's and Stat's NDV Definition
> Related to the above issue is the fact that the planner and the stats
> mechanisms use different definitions for NDV.
> * Planner: NDV includes nulls. This is seen when computing the NDV for
> constant expressions.
> * NDV function and thus stats: NDV does not include nulls.
> Currently, the planner takes the stat's "without nulls" NDV and mixes it with
> the planer's "with nulls" definition. IMPALA-7310 proposes a way to convert
> the stat's definition to the planner's definition.
> h4. Estimate Row Counts
> The above will go a long way to ensure that the planner always renders a
> cardinality estimate. But, one hole remains. IMPALA-7608 says that
> cardinality estimates are undefined if stats are unavailable because we don't
> know row counts. IMPALA-7608 explains a useful, common rule-of-thumb. Add
> that and we'd have covered all situations in which Impala currently refuses
> to produce a cardinality estimate.
> h.4 Handle Large Cardinality
> IMPALA-7604 notes that we use a {{long}} to hold a cardinality estimate. This
> can overflow, causing incorrect estimates. As we add the above, we will put
> more confidence in the cardinality estimates. We need to fix the overflow as
> it can reduce the value of the cardinality by producing wrong numbers in
> extreme cases.
> h4. Remove Special Handling for Undefined Selectivity and Cardinality
> Once all of the above are available, every plan node will compute a
> cardinality. The existing code to handle cardinality = -1 (undefined) can be
> removed. Tests can be updated to enforce that all plans will have an
> estimated cardinality.
> h4. References
> * [Cost Estimation
> Overview|https://courses.cs.washington.edu/courses/cse544/09wi/lecture-notes/lecture8/lecture8.pdf]
> * [Reduction
> factors|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
> from the classic Ramakrishnan and Gehrke book.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]