[jira] [Commented] (HIVE-29365) Range predicate within the same histogram bucket leads to an estimate rowcount=1

Thomas Rebele (Jira) Fri, 12 Dec 2025 01:08:51 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18044627#comment-18044627
 ]


Thomas Rebele commented on HIVE-29365:
--------------------------------------

Do you mean the true values or the ones from the histogram? Here are the true 
values:
||predicate||expected result from Trino DB||
|total count|80000000|
|c_customer_sk < 79000|78999|
|c_customer_sk <= 79000|79000|
|c_customer_sk < 160500|160499|
|c_customer_sk <= 160500|160500|

I think the sk stands for "surrogate key", so an artificially generated key. 
The c_customer_sk are filled with consecutive integers.

> Range predicate within the same histogram bucket leads to an estimate 
> rowcount=1
> --------------------------------------------------------------------------------
>
>                 Key: HIVE-29365
>                 URL: https://issues.apache.org/jira/browse/HIVE-29365
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Thomas Rebele
>            Priority: Major
>
> The rowcount estimate is wrong for range predicates (comparison operators as 
> well as BETWEEN) if both borders of the range are within the same bucket.
> Reason: KllFloatsSketch#getCDF calls FloatsSketchSortedView#getRank. Neither 
> interpolate the rank for a quantile between the two borders of a bucket. 
> Therefore both estimated ranks are the same if the borders of the range are 
> within the same bucket. The Wikipedia page Percentile has a [section on 
> interpolation|https://en.wikipedia.org/w/index.php?title=Percentile&oldid=1325431679#The_linear_interpolation_between_closest_ranks_method].
> ----
> Example on a metastore dump of a TPC-DS 30TB cluster with histograms (tried 
> on a commit of 2025-12-04, ca105f8124072d19d88a83b2ced613d326c9a26b): 
> {*}Locating the border{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk < 
> 79000;
> explain cbo joincost select count(*) from customer where c_customer_sk < 
> 160500;{code}
> Both give the same estimate for the rowcount of the filter {{{}rowcount = 
> 29696.000000000004{}}}. 
> {*}Check the estimates{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk 
> between 79000 and 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 17183 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17181 |
> |     HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT, 
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 17180 |
> |       HiveTableScan(table=[[default, customer]], table:alias=[customer]): 
> rowcount = 8.0E7, cumulative cost = {0}, id = 17134 | {code}
> Please note the estimation {{rowcount = 1.0}} of the HiveFilter. The same 
> happens for range predicates with comparison operators:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk >= 
> 79000 and c_customer_sk <= 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 17088 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17086 |
> |     HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT, 
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 17085 |
> |       HiveTableScan(table=[[default, customer]], table:alias=[customer]): 
> rowcount = 8.0E7, cumulative cost = {0}, id = 17039 | {code}
> *Compare with the expected result*
> Executing the SELECT query using Trino DB gives the following result:
> {code:java}
> use tpcds.sf30000;
> trino:sf30000> select count(*) from customer where c_customer_sk between 
> 79000 and 160500;
>  _col0 
> -------
>  81501 {code}
> So the estimates should be around {{rowcount = 81501.0}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-29365) Range predicate within the same histogram bucket leads to an estimate rowcount=1

Reply via email to