Thomas Rebele created HIVE-29365:
------------------------------------

             Summary: Range predicate within the same histogram bucket leads to 
an estimate rowcount=1
                 Key: HIVE-29365
                 URL: https://issues.apache.org/jira/browse/HIVE-29365
             Project: Hive
          Issue Type: Bug
            Reporter: Thomas Rebele


The rowcount estimate is wrong for range predicates (comparison operators as 
well as BETWEEN) if both borders of the range are within the same bucket.

Reason: KllFloatsSketch#getCDF calls FloatsSketchSortedView#getRank. Neither 
interpolate the rank for a quantile between the two borders of a bucket. 
Therefore both estimated ranks are the same if the borders of the range are 
within the same bucket.
----
Example on a metastore dump of a TPC-DS 30TB cluster with histograms (tried on 
a commit of 2025-12-04, ca105f8124072d19d88a83b2ced613d326c9a26b): 

{*}Locating the border{*}:
{code:java}
explain cbo joincost select count(*) from customer where c_customer_sk < 79000;
explain cbo joincost select count(*) from customer where c_customer_sk < 
160500;{code}
Both give the same estimate for the rowcount of the filter {{{}rowcount = 
29696.000000000004{}}}. 

{*}Check the estimates{*}:
{code:java}
explain cbo joincost select count(*) from customer where c_customer_sk between 
79000 and 160500; {code}
returns a plan
{code:java}
| HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 
0.0 io}, id = 17183 |
|   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative cost 
= {0.0 rows, 0.0 cpu, 0.0 io}, id = 17181 |
|     HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT, 160500:BIGINT)]): 
rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17180 |
|       HiveTableScan(table=[[default, customer]], table:alias=[customer]): 
rowcount = 8.0E7, cumulative cost = {0}, id = 17134 | {code}
Please note the estimation {{rowcount = 1.0}} of the HiveFilter. The same 
happens for range predicates with comparison operators:
{code:java}
explain cbo joincost select count(*) from customer where c_customer_sk >= 79000 
and c_customer_sk <= 160500; {code}
returns a plan
{code:java}
| HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 
0.0 io}, id = 17088 |
|   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative cost 
= {0.0 rows, 0.0 cpu, 0.0 io}, id = 17086 |
|     HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT, 160500:BIGINT)]): 
rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17085 |
|       HiveTableScan(table=[[default, customer]], table:alias=[customer]): 
rowcount = 8.0E7, cumulative cost = {0}, id = 17039 | {code}
*Compare with the expected result*

Executing the SELECT query using Trino DB gives the following result:
{code:java}
use tpcds.sf30000;
trino:sf30000> select count(*) from customer where c_customer_sk between 79000 
and 160500;
 _col0 
-------
 81501 {code}
So the estimates should be around {{rowcount = 81501.0}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to