[ 
https://issues.apache.org/jira/browse/HIVE-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18044748#comment-18044748
 ] 

Alexander Saydakov commented on HIVE-29365:
-------------------------------------------

Here is the documentation regarding accuracy:
[https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html]

For the default k=200 we should expect rank error within 1.33% and double-sided 
error (mass in a bucket) within 1.65%

True rank of 79000 equals the fraction of the mass of the distribution <= this 
value: 79000 / 80000000 = 0.0009875

True rank of 160500 is 0.00200625
The true mass between these values is 0.00200625 - 0.0009875 = 0.00101875
With k=200 we should expect rank estimates within +-0.0133 and mass estimates 
+-0.0165
So the true mass of 0.00101875 is way below possible sketch error, and 
therefore not distinguishable from 0.

> Range predicate within the same histogram bucket leads to an estimate 
> rowcount=1
> --------------------------------------------------------------------------------
>
>                 Key: HIVE-29365
>                 URL: https://issues.apache.org/jira/browse/HIVE-29365
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Thomas Rebele
>            Priority: Major
>
> The rowcount estimate is wrong for range predicates (comparison operators as 
> well as BETWEEN) if both borders of the range are within the same bucket.
> Reason: KllFloatsSketch#getCDF calls FloatsSketchSortedView#getRank. Neither 
> interpolate the rank for a quantile between the two borders of a bucket. 
> Therefore both estimated ranks are the same if the borders of the range are 
> within the same bucket. The Wikipedia page Percentile has a [section on 
> interpolation|https://en.wikipedia.org/w/index.php?title=Percentile&oldid=1325431679#The_linear_interpolation_between_closest_ranks_method].
> ----
> Example on a metastore dump of a TPC-DS 30TB cluster with histograms (tried 
> on a commit of 2025-12-04, ca105f8124072d19d88a83b2ced613d326c9a26b): 
> {*}Locating the border{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk < 
> 79000;
> explain cbo joincost select count(*) from customer where c_customer_sk < 
> 160500;{code}
> Both give the same estimate for the rowcount of the filter {{{}rowcount = 
> 29696.000000000004{}}}. 
> {*}Check the estimates{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk 
> between 79000 and 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 17183 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17181 |
> |     HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT, 
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 17180 |
> |       HiveTableScan(table=[[default, customer]], table:alias=[customer]): 
> rowcount = 8.0E7, cumulative cost = {0}, id = 17134 | {code}
> Please note the estimation {{rowcount = 1.0}} of the HiveFilter. The same 
> happens for range predicates with comparison operators:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk >= 
> 79000 and c_customer_sk <= 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 17088 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17086 |
> |     HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT, 
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 17085 |
> |       HiveTableScan(table=[[default, customer]], table:alias=[customer]): 
> rowcount = 8.0E7, cumulative cost = {0}, id = 17039 | {code}
> *Compare with the expected result*
> Executing the SELECT query using Trino DB gives the following result:
> {code:java}
> use tpcds.sf30000;
> trino:sf30000> select count(*) from customer where c_customer_sk between 
> 79000 and 160500;
>  _col0 
> -------
>  81501 {code}
> So the estimates should be around {{rowcount = 81501.0}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to