[jira] [Updated] (IMPALA-7601) Improve cardinality and selectivity estimates

Jira Tue, 20 Jan 2026 02:11:16 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zoltán Borók-Nagy updated IMPALA-7601:
--------------------------------------
    Target Version: Impala 5.0.0

> Improve cardinality and selectivity estimates
> ---------------------------------------------
>
>                 Key: IMPALA-7601
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7601
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Impala makes extensive use of table stats during query planning. For example, 
> the NDV (number of distinct values) is used to compute _selectivity_, the 
> degree of reduction (also called the _reduction factor_) provided by a 
> predicate. For example:
> {noformat}
> SELECT * FROM t WHERE t.a = 10
> {noformat}
> If we know that {{t.a}} has an NDV=100, then we can predict (given a uniform 
> distribution of values), that the above query will pick out one of these 100 
> values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity 
> of the predicate {{t.a = 10}} is 0.01.
> Selectivity is then used to compute _cardinality_ the number of rows in some 
> relation (set of tuples.) Cardinality is used to pick among plan options, 
> such as which relation to put on the probe vs. build side of a join.
> This ticket explains several cases in which Impala fails to produce a 
> cardinality estimate, and proposes solutions.
> h4. Problem Summary
> The goal of a planner is to estimate the cardinality of various relations 
> (sets of tuples) in order to choose a good plan (which relation to put on the 
> probe side of a join) and for memory estimates (about how much memory is 
> needed for a sort?) When planning, even crude estimates are fine as the 
> planner only chooses among a discrete set of options. The smaller table goes 
> on the probe side of a join, it does not matter _how much_ smaller one table 
> is than the other.
> It turns out, however, that Impala is quite finicky: refusing to render an 
> cardinality estimate if certain information is missing. This means that, 
> rather than use a poor estimate, Impala would rather use no estimate at all. 
> Rather than produce a crude attempt at a plan, Impala flies blind in this 
> cases.
> You can see this in 
> [{{PlanNode.computeCombinedSelectivity()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/PlanNode.java].
> The result is that Impala is a bit more strict than classic DB optimizers. If 
> stats are present, they are used. If stats are not present, Impala flies 
> blind.
> This ticket proposal a number of interrelated changes to add a-priori (before 
> observation) defaults for selectivity and NDV based on classic DB practice.
> Also, some of our existing estimates are overly simplified as discussed below.
> h4. A Bit of Math
> Selectivity is a probability: the probability that an Boolean expression 
> evaluates to true. That is:
> {noformat}
> selectivity c = x = p( (c  = x) = true) = p(c = x)
> {noformat}
> Where {{c}} is a column and {{x}} is a constant.
> Knowing this can help pick good selectivity values, and makes the following 
> discussion a bit simpler.
> h4. Use NDV to Compute {{p(c != x)}}
> Impala uses NDV to compute {{p(c=x)}}:
> {noformat}
> p(c = x) = 1 / NDV(c)
> {noformat}
> As described in IMPALA-7560, Impala uses a selectivity of 0.1 for {{p(c != 
> x)}}. There is, however, a mathematical relationship between these operators 
> we can exploit:
> {noformat}
> p(c != x) = 1 - p(c = x)
> {noformat}
> That is, if we have {{NDV(c)}} and can compute {{p(c = x)}}, we can also 
> compute {{p(c != x)}}.
> h4. Refine Default Selectivity Values
> Further, Impala assumes a selectivity 0.1 for all other relational operators 
> except equality. See 
> [Impala|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Expr.java]:
> {noformat}
>   // To be used where we cannot come up with a
>   // better estimate (selectivity_ is -1).
>   public static double DEFAULT_SELECTIVITY = 0.1;
> {noformat}
> There is another mathematical relationship between these operators we can 
> exploit to refine some of these estimates:
> {noformat}
> p(c != x) = p(c < x) + p(c > x)
> {noformat}
> As it turns out, we don't need the NDV to compute an inequality. We can 
> simply observe that the inequalities roughly divide the data into two sets, 
> of which the user picks one. The current default of 0.1 is probably too low, 
> and 0.5 is probably too high. The [Ramakrishnan and Gehrke 
> book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
>  suggests rule-of-thumb estimates:
>  * {{p(c < x)}}, {{p(c <= x)}}, {{p(c > x)}}, {{p(c >= x)}} - 0.3
>  * {{p(c BETWEEN x AND y)}} - 0.25
> h4. Default Selectivity for {{p(c = x)}} and {{p(c != x)}} Without Stats
> All this is good. But, what happens if statistics are not available for table 
> {{t}}? How are we to know the selectivity of the predicate?
> Today, Impala simply refuses to pick a selectivity for {{p(c = x)}} if there 
> are no stats. See 
> [{{BinaryPredicate.analyzeImpl()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java].
> However, even without stats, Impala continues to use its defaults for all 
> other relational operators. While it is true that, without NDV, we can't be 
> positive of the selectivity of {{p(c = x)}}, it is also true we are happy to 
> guess for all other operators.
> The [Ramakrishnan and Gehrke 
> book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
>  shows that the standard choice is to assume a selectivity of 0.1.
> Over in the Drill project, DRILL-5254 attempted to work out better estimates 
> for other operators based on math and probability. However, the conclusion 
> there was that, without NDV and histograms, there is more information in the 
> user's intent than in the math. That is, if the user writes {{WHERE t.a \!= 
> 10}}, there is a conditional probability that the user believes that this is 
> a highly restrictive predicate, especially on big data. So, the reduction 
> factor (which is a probability) is the same for {{=}} and {{!=}} in the 
> absence of information. The same reasoning probably led to the rule-of-thumb 
> values in the 
> [R&G|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
>  book.
> So, if no stats are available, use a default number for {{p(c != x)}} such as 
> 0.1 or 0.7. (The larger number is probably better.)
> As a result of this change, we will always have a selectivity, even if just a 
> rule-of-thumb one. That is, some information is better than none.
> h4. Special Case Boolean Columns
> There is one special case it may be worth exploiting. If a column is Boolean, 
> then we know that it can have at most three values (true, false, null). So, 
> even without status, we can estimate {{NDV(c)}} = 3. This allows us to get 
> better estimates for {{p(c = x)}} and {{p(c != x)}} than we'd get just using 
> the defaults.
> h4. Separate the NDV=0 and Unknown NDV cases
> As described in IMPALA-7310, Impala does not currently handle the case of a 
> table with stats, but a column in the table contains only NULL values. The 
> stats mechanism defines NDV as "number of distinct non-null values". So, a 
> table full of nulls has NDV = 0. Unfortunately, if no stats are available at 
> all, we also have NDV = 0.
> IMPALA-7310 will fix this issue, removing one more case in which Impala did 
> not produce a cardinality estimate.
> h4. Rationalize the Planner's and Stat's NDV Definition
> Related to the above issue is the fact that the planner and the stats 
> mechanisms use different definitions for NDV.
>  * Planner: NDV includes nulls. This is seen when computing the NDV for 
> constant expressions.
>  * NDV function and thus stats: NDV does not include nulls.
> Currently, the planner takes the stat's "without nulls" NDV and mixes it with 
> the planer's "with nulls" definition. IMPALA-7310 proposes a way to convert 
> the stat's definition to the planner's definition.
> h4. Estimate Row Counts
> The above will go a long way to ensure that the planner always renders a 
> cardinality estimate. But, one hole remains. IMPALA-7608 says that 
> cardinality estimates are undefined if stats are unavailable because we don't 
> know row counts. IMPALA-7608 explains a useful, common rule-of-thumb. Add 
> that and we'd have covered all situations in which Impala currently refuses 
> to produce a cardinality estimate.
> h.4 Handle Large Cardinality
> IMPALA-7604 notes that we use a {{long}} to hold a cardinality estimate. This 
> can overflow, causing incorrect estimates. As we add the above, we will put 
> more confidence in the cardinality estimates. We need to fix the overflow as 
> it can reduce the value of the cardinality by producing wrong numbers in 
> extreme cases.
> h4. Remove Special Handling for Undefined Selectivity and Cardinality
> Once all of the above are available, every plan node will compute a 
> cardinality. The existing code to handle cardinality = -1 (undefined) can be 
> removed. Tests can be updated to enforce that all plans will have an 
> estimated cardinality.
> h4. References
>  * [Cost Estimation 
> Overview|https://courses.cs.washington.edu/courses/cse544/09wi/lecture-notes/lecture8/lecture8.pdf]
>  * [Reduction 
> factors|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
>  from the classic Ramakrishnan and Gehrke book.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-7601) Improve cardinality and selectivity estimates

Reply via email to