[
https://issues.apache.org/jira/browse/IMPALA-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers updated IMPALA-7601:
--------------------------------
Summary: Define better default selectivity values (was: Define a-priori
selectivity and NDV values)
> Define better default selectivity values
> ----------------------------------------
>
> Key: IMPALA-7601
> URL: https://issues.apache.org/jira/browse/IMPALA-7601
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 2.12.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Major
>
> Impala makes extensive use of table stats during query planning. For example,
> the NDV (number of distinct values) is used to compute _selectivity_, the
> degree of reduction (also called the _reduction factor_) provided by a
> predicate. For example:
> {noformat}
> SELECT * FROM t WHERE t.a = 10
> {noformat}
> If we know that {{t.a}} has an NDV=100, then we can predict (given a uniform
> distribution of values), that the above query will pick out one of these 100
> values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity
> of the predicate {{t.a = 10}} is 0.01.
> For predicate operators other than equality ({{=}}), Impala uses 0.1 for the
> estimated selectivity. There are two observations. First, Impala does use
> a-priority selectivity estimates for these operators (NDV does not help with
> inequality, for example.) Second, the 0.01 is a bit too much of a
> one-size-fits-all estimate.
> h4. Selectivity Without Stats
> All this is good. But, what happens if statistics are not available for table
> {{t}}? How are we to know the selectivity of the predicate?
> It could be that {{t.a}} contains nothing but the value 10, so there is no
> reduction at all. I could be that {{t.a}} contains no values of 10, so the
> reduction is total, no rows are returned. The classic solution is to assume
> that the user put the predicate in the query for the purpose of subsetting
> the data. The classic value, shown in the [Ramakrishnan and Gehrke
> book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
> is to assume a 90% reduction, or a selectivity of 0.1. Indeed this value is
> seen in
> [Impala|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Expr.java]:
> {noformat}
> // To be used where we cannot come up with a
> // better estimate (selectivity_ is -1).
> public static double DEFAULT_SELECTIVITY = 0.1;
> {noformat}
> As it turns out, however, the actual implementation is a bit more complex, as
> hinted at by the above comment. Impala relies on stats. Given stats,
> specifically the NDV, we compute selectivity as:
> {noformat}
> selectivity = 1 / ndv
> {noformat}
> What happens if there is no available NDV? In the case, we skip the
> selectivity calculation and leave it at a special default value of -1.0,
> which seems to indicate "unknown". See
> [{{BinaryPredicate.analyzeImpl()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java].
> Later, when we use the selectivity to calculate reduction factors, we simply
> skip any node with a selectivity of -1. You can see that in
> [{{PlanNode.computeCombinedSelectivity()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/PlanNode.java].
> The result is that Impala is a bit more strict than classic DB optimizers. If
> stats are present, they are used. If stats are not present, Impala assumes
> that predicates have no effect.
> This ticket proposal a number of interrelated changes to add a-priori (before
> observation) defaults for selectivity and NDV based on classic DB practice.
> h4. Proposal: Add A-priori Selectivity Values
> But, we said earlier that users include a predicate because they expect it to
> do something. So, we are actually discarding the (albeit vague) information
> that the user provided.
> This is why many optimizers go ahead and assume a default 0.1 reduction
> factor for equality predicates even if no stats are available. The first
> proposal of this ticket is to use that default reduction factor even if no
> stats are present. This says that some reduction will occur, but, to be
> conservative, we assume not a huge reduction.
> h4. Proposal: Add Selectivity for All Predicate Operators
> As present, Impala computes reduction factors only for equality nodes. (See
> IMPALA-7560.) The book suggests rule-of-thumb estimates for other operators:
> * {{!=}} - 0.1
> * {{<}}, {{<=}}, {{>}}, {{>=}} - 0.3
> * {{BETWEEN}} - 0.25
> Over in the Drill project, DRILL-5254 attempted to work out better estimates
> based on math and probability. However, the conclusion there was that,
> without NDV and histograms, there is more information in the user's intent
> than in the math. That is, if the user writes {{WHERE t.a != 10}}, there is a
> conditional probability that the user believes that this is a highly
> restrictive predicate, especially on big data. So, the reduction factor
> (which is a probability) is the same for {{=}} and {{!=}} in the absence of
> information. The same reasoning probably led to the rule-of-thumb values in
> the
> [R&G|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
> book.
> So, the second proposal is that Impala use the classic numbers for other
> operators when no stats are available.
> h4. Proposal: Use Stats-Based Selectivity Estimates When Available
> If stats are available, then we can "run the numbers" and get better
> estimates:
> * {{p(a = x)}} = 1 / NDV
> * {{p(a != x)}} = {{1 - p(a = x)}} = 1 - 1 / NDV
> So, the third proposal is to use the above math for the {{!=}} case.
> h4. Proposal: Use Defaults when Stats are Unavailable or Not Useful
> As it turns out, without histograms, the NDV by itself gives us no
> information by which to compute reduction for an inequality. So the, classic
> a-priori defaults are the still best guess.
> This leads to the fourth proposal: that the classic defaults are used unless
> stats provides additional information to use instead. Said another way,
> retire the current selectivity = -1 ("undefined") concept: we will always
> have a selectivity, even if just a rule-of-thumb one. That is, some
> information is better than none.
> h4. Proposal: Define an A-priori NDV Estimate
> Impala uses NDV for a number of purposes. If no stats are available, Impala
> assumes no NDV. Given the assumed reduction factor, we can guess an NDV =
> 1/0.1 = 10. Depending on where NDV is used (need to research this), we might
> want to choose some other value, perhaps 100. So, the fifth proposal is to
> assume an a-priori (before observation) NDV value, with the actual default
> value TBD.
> h4. Proposal: Separate the NDV=0 and Unknown NDV cases
> As described in IMPALA-7310, Impala does not currently handle the case of a
> table with stats, but a column in the table contains only NULL values. The
> stats mechanism defines HDV as "number of distinct non-null values". So, a
> table full of nulls has NDV = 0. Unfortunately, if no stats are available at
> all, we also have NDV = 0.
> So, the sixth proposal is to separate these cases. If we can tell that a
> column is a nullable type, assume that the actual NDV is (stats NDV + 1).
> (This is, in fact the NDV calc used by Impala itself when computing NDV for
> expressions. See
> [{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java].
> To avoid impact to existing tests, apply this rule only when it makes an
> impact, when the NDV value is, say, less than 10. (No reason to count the
> null value if NDV is large.) So, a column with all nulls will have an
> (adjusted) NDV = 1.
> Then, as noted above, if no stats exist, use the a-priori NDV.
> h4. References
> * [Cost Estimation
> Overview|https://courses.cs.washington.edu/courses/cse544/09wi/lecture-notes/lecture8/lecture8.pdf]
> * [Reduction
> factors|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
> from the classic Ramakrishnan and Gehrke book.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]