[ 
https://issues.apache.org/jira/browse/IMPALA-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated IMPALA-7601:
--------------------------------
    Summary: Define better default selectivity values  (was: Define a-priori 
selectivity and NDV values)

> Define better default selectivity values
> ----------------------------------------
>
>                 Key: IMPALA-7601
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7601
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.12.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> Impala makes extensive use of table stats during query planning. For example, 
> the NDV (number of distinct values) is used to compute _selectivity_, the 
> degree of reduction (also called the _reduction factor_) provided by a 
> predicate. For example:
> {noformat}
> SELECT * FROM t WHERE t.a = 10
> {noformat}
> If we know that {{t.a}} has an NDV=100, then we can predict (given a uniform 
> distribution of values), that the above query will pick out one of these 100 
> values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity 
> of the predicate {{t.a = 10}} is 0.01.
> For predicate operators other than equality ({{=}}), Impala uses 0.1 for the 
> estimated selectivity. There are two observations. First, Impala does use 
> a-priority selectivity estimates for these operators (NDV does not help with 
> inequality, for example.) Second, the 0.01 is a bit too much of a 
> one-size-fits-all estimate.
> h4. Selectivity Without Stats
> All this is good. But, what happens if statistics are not available for table 
> {{t}}? How are we to know the selectivity of the predicate?
> It could be that {{t.a}} contains nothing but the value 10, so there is no 
> reduction at all. I could be that {{t.a}} contains no values of 10, so the 
> reduction is total, no rows are returned. The classic solution is to assume 
> that the user put the predicate in the query for the purpose of subsetting 
> the data. The classic value, shown in the [Ramakrishnan and Gehrke 
> book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
>  is to assume a 90% reduction, or a selectivity of 0.1. Indeed this value is 
> seen in 
> [Impala|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Expr.java]:
> {noformat}
>   // To be used where we cannot come up with a
>   // better estimate (selectivity_ is -1).
>   public static double DEFAULT_SELECTIVITY = 0.1;
> {noformat}
> As it turns out, however, the actual implementation is a bit more complex, as 
> hinted at by the above comment. Impala relies on stats. Given stats, 
> specifically the NDV, we compute selectivity as:
> {noformat}
> selectivity = 1 / ndv
> {noformat}
> What happens if there is no available NDV? In the case, we skip the 
> selectivity calculation and leave it at a special default value of -1.0, 
> which seems to indicate "unknown". See 
> [{{BinaryPredicate.analyzeImpl()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java].
> Later, when we use the selectivity to calculate reduction factors, we simply 
> skip any node with a selectivity of -1. You can see that in 
> [{{PlanNode.computeCombinedSelectivity()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/PlanNode.java].
> The result is that Impala is a bit more strict than classic DB optimizers. If 
> stats are present, they are used. If stats are not present, Impala assumes 
> that predicates have no effect.
> This ticket proposal a number of interrelated changes to add a-priori (before 
> observation) defaults for selectivity and NDV based on classic DB practice.
> h4. Proposal: Add A-priori Selectivity Values
> But, we said earlier that users include a predicate because they expect it to 
> do something. So, we are actually discarding the (albeit vague) information 
> that the user provided.
> This is why many optimizers go ahead and assume a default 0.1 reduction 
> factor for equality predicates even if no stats are available. The first 
> proposal of this ticket is to use that default reduction factor even if no 
> stats are present. This says that some reduction will occur, but, to be 
> conservative, we assume not a huge reduction.
> h4. Proposal: Add Selectivity for All Predicate Operators
> As present, Impala computes reduction factors only for equality nodes. (See 
> IMPALA-7560.) The book suggests rule-of-thumb estimates for other operators:
>  * {{!=}} - 0.1
>  * {{<}}, {{<=}}, {{>}}, {{>=}} - 0.3
>  * {{BETWEEN}} - 0.25
> Over in the Drill project, DRILL-5254 attempted to work out better estimates 
> based on math and probability. However, the conclusion there was that, 
> without NDV and histograms, there is more information in the user's intent 
> than in the math. That is, if the user writes {{WHERE t.a != 10}}, there is a 
> conditional probability that the user believes that this is a highly 
> restrictive predicate, especially on big data. So, the reduction factor 
> (which is a probability) is the same for {{=}} and {{!=}} in the absence of 
> information. The same reasoning probably led to the rule-of-thumb values in 
> the 
> [R&G|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
>  book.
> So, the second proposal is that Impala use the classic numbers for other 
> operators when no stats are available.
> h4. Proposal: Use Stats-Based Selectivity Estimates When Available
> If stats are available, then we can "run the numbers" and get better 
> estimates:
>  * {{p(a = x)}} = 1 / NDV
>  * {{p(a != x)}} = {{1 - p(a = x)}} = 1 - 1 / NDV
> So, the third proposal is to use the above math for the {{!=}} case.
> h4. Proposal: Use Defaults when Stats are Unavailable or Not Useful
> As it turns out, without histograms, the NDV by itself gives us no 
> information by which to compute reduction for an inequality. So the, classic 
> a-priori defaults are the still best guess.
> This leads to the fourth proposal: that the classic defaults are used unless 
> stats provides additional information to use instead. Said another way, 
> retire the current selectivity = -1 ("undefined") concept: we will always 
> have a selectivity, even if just a rule-of-thumb one. That is, some 
> information is better than none.
> h4. Proposal: Define an A-priori NDV Estimate
> Impala uses NDV for a number of purposes. If no stats are available, Impala 
> assumes no NDV. Given the assumed reduction factor, we can guess an NDV = 
> 1/0.1 = 10. Depending on where NDV is used (need to research this), we might 
> want to choose some other value, perhaps 100. So, the fifth proposal is to 
> assume an a-priori (before observation) NDV value, with the actual default 
> value TBD.
> h4. Proposal: Separate the NDV=0 and Unknown NDV cases
> As described in IMPALA-7310, Impala does not currently handle the case of a 
> table with stats, but a column in the table contains only NULL values. The 
> stats mechanism defines HDV as "number of distinct non-null values". So, a 
> table full of nulls has NDV = 0. Unfortunately, if no stats are available at 
> all, we also have NDV = 0.
> So, the sixth proposal is to separate these cases. If we can tell that a 
> column is a nullable type, assume that the actual NDV is (stats NDV + 1). 
> (This is, in fact the NDV calc used by Impala itself when computing NDV for 
> expressions. See 
> [{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java].
> To avoid impact to existing tests, apply this rule only when it makes an 
> impact, when the NDV value is, say, less than 10. (No reason to count the 
> null value if NDV is large.) So, a column with all nulls will have an 
> (adjusted) NDV = 1.
> Then, as noted above, if no stats exist, use the a-priori NDV.
> h4. References
>  * [Cost Estimation 
> Overview|https://courses.cs.washington.edu/courses/cse544/09wi/lecture-notes/lecture8/lecture8.pdf]
>  * [Reduction 
> factors|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
>  from the classic Ramakrishnan and Gehrke book.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to