Stamatis Zampetakis created HIVE-29479:
------------------------------------------
Summary: Improve histogram-based selectivity estimation for
two-sided range predicates
Key: HIVE-29479
URL: https://issues.apache.org/jira/browse/HIVE-29479
Project: Hive
Issue Type: Improvement
Components: CBO
Reporter: Stamatis Zampetakis
HIVE-26221 introduced histogram statistics to improve selectivity estimation
for range predicates.
The current estimation logic works reasonably well for:
* Single-sided range predicates (e.g., {{age > 30}})
* Closed bounded ranges (e.g., {{age BETWEEN 30 AND 40}})
However, selectivity estimation for general two-sided range predicates (i.e.,
predicates with both lower and upper bounds) can still be significantly
improved.
Consider the following examples:
* {{(30, 40]}} → {{age > 30 AND age <= 40}}
* {{(30, 40)}} → {{age > 30 AND age < 40}}
* {{[30, 40)}} → {{age >= 30 AND age < 40}}
Currently, the selectivity of such predicates is computed by multiplying the
selectivity of each conjunct independently.
For example, assume a dataset containing integer ages from 1 to 100 (inclusive).
For the predicate {{(30, 40]}}:
{noformat}
sel(age > 30) = 70 / 100 = 0.7
sel(age <= 40) = 40 / 100 = 0.4
Estimated selectivity = 0.7 * 0.4 = 0.28
{noformat}
However, the actual selectivity is: 10 / 100 = 0.1
The current approach implicitly assumes independence between the lower and
upper bound predicates, which leads to significant overestimation for narrow
ranges.
Since histogram statistics are already available, the correct selectivity can
be derived directly from the cumulative distribution defined by the histogram
bucket boundaries, instead of multiplying the two sides independently.
*Proposed Improvement:*
Enhance
[FilterSelectivityEstimator|https://github.com/apache/hive/blob/fb5e8703d915ec045a3c0148a765c9554b22bfe9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L111]
to:
* Detect two-sided range predicates.
* Compute selectivity using histogram cumulative distribution rather than
independent multiplication.
* Correctly handle all combinations of inclusive and exclusive bounds.
Most of the required logic is already present. The improvement should primarily
involve refactoring and generalizing the existing range-handling code.
The {{SEARCH}} operator internally represents ranges and can be leveraged to
make the implementation more generic and robust.
As part of this change, the current {{SEARCH}} expansion performed by
{{SearchTransformer}} inside
[FilterSelectivityEstimator|https://github.com/apache/hive/blob/fb5e8703d915ec045a3c0148a765c9554b22bfe9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L111]
will likely need to be removed or reworked to preserve range semantics during
estimation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)