[
https://issues.apache.org/jira/browse/IMPALA-8042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837895#comment-17837895
]
David Rorke commented on IMPALA-8042:
-------------------------------------
There are specific cases for BETWEEN expressions where we should be able to
make a much more accurate selectivity estimate, in particular date columns (and
maybe other column types with discrete values) if we know or strongly suspect
the values are all unique (or at least very high NDV) and there are few
"missing" values. In cases like this we might simply assume that the number of
rows selected is the number of possible distinct values in the range. This can
be wrong in a couple cases:
* Duplicate values. So we should only apply this when we suspect uniqueness or
something close to uniqueness (very high NDV relative to the total row count).
* Missing values (again we can probably use some NDV-based heuristics to make
a good guess about whether most of the possible values are populated).
Even with some possible inaccuracy from these factors it's likely we can be
much more accurate using this approach under high NDV situations vs just using
the current 10 percent selectivity guess.
> Better selectivity estimate for BETWEEN
> ---------------------------------------
>
> Key: IMPALA-8042
> URL: https://issues.apache.org/jira/browse/IMPALA-8042
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Priority: Minor
>
> The analyzer rewrites a BETWEEN expression into a pair of inequalities.
> IMPALA-8037 explains that the planner then groups all such non-quality
> conditions together and assigns a selectivity of 0.1. IMPALA-8031 explains
> that the analyzer should handle inequalities better.
> BETWEEN is a special case and informs the final result. If we assume a
> selectivity of s for inequality, then BETWEEN should be something like s/2.
> The intuition is that if c >= x includes, say, ⅓ of values, and c <= y
> includes a third of values, then c BETWEEN x AND y should be a narrower set
> of values, say ⅙.
> [Ramakrishnan an
> Gherke|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html\
> recommend 0.4 for between, 0.3 for inequality, and 0.3^2 = 0.09 for the
> general expression x <= c AND c <= Y. Note the discrepancy between the
> compound inequality case and the BETWEEN case, likely reflecting the
> additional information we obtain when the user chooses to use BETWEEN.
> To implement a special BETWEEN selectivity in Impala, we must remember the
> selectivity of BETWEEN during the rewrite to a compound inequality.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]