[ 
https://issues.apache.org/jira/browse/HIVE-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050172#comment-18050172
 ] 

Thomas Rebele commented on HIVE-29365:
--------------------------------------

Thank you for your explanations, Alexander!

I've revisited the interpolation approach. It is more important to get the 
order of magnitude correct than to get the exact value, so a low multiplicative 
error would be better (not the additive error guarantees from the KLL sketch). 
I did some experiments, and at first interpolation seemed to be reduce the 
error quite substantially.

However, I stumbled upon a case, where the interpolated selectivity of a range 
is way too low. The cause is that the quantiles and cumulative weights of the 
KLL sketch are themselves just approximations (not just samples of the true 
values, as I had wrongly assumed). The cumulative weights have several steps 
(like a terrace in geology). If both points lie within one of these steps, it 
is not helpful to interpolate.

The following diagrams might make it clearer: I've generated random with a 
Gaussian distribution. Here the cumulative distribution function: 

!expected.png|width=480,height=298!

I've generated a KLL sketch from them (with k=8 to better illustrate the issue):
!kll-k8.png|width=468,height=299!

For example, there's a plateau from (933,0.2671) to (947,0.2676), then a jump 
to (950, 0.3085). Unfortunately I don't see an easy workaround for this problem.

For comparison: I've evaluated the t-digest sketch (for k=10) at the quantiles 
of the above diagrams; the resulting diagram follows the form of the original 
CDF function. The size of the sketches are comparable (KLL: 472 bytes, 
t-digest: 432 bytes):

!tdigest-interpolated-k10.png|width=478,height=300!

> Range predicate within the same histogram bucket leads to an estimate 
> rowcount=1
> --------------------------------------------------------------------------------
>
>                 Key: HIVE-29365
>                 URL: https://issues.apache.org/jira/browse/HIVE-29365
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Thomas Rebele
>            Priority: Major
>         Attachments: expected.png, kll-k8.png, tdigest-interpolated-k10.png
>
>
> The rowcount estimate is wrong for range predicates (comparison operators as 
> well as BETWEEN) if both borders of the range are within the same bucket.
> Reason: KllFloatsSketch#getCDF calls FloatsSketchSortedView#getRank. Neither 
> interpolate the rank for a quantile between the two borders of a bucket. 
> Therefore both estimated ranks are the same if the borders of the range are 
> within the same bucket. The Wikipedia page Percentile has a [section on 
> interpolation|https://en.wikipedia.org/w/index.php?title=Percentile&oldid=1325431679#The_linear_interpolation_between_closest_ranks_method].
> ----
> Example on a metastore dump of a TPC-DS 30TB cluster with histograms (tried 
> on a commit of 2025-12-04, ca105f8124072d19d88a83b2ced613d326c9a26b): 
> {*}Locating the border{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk < 
> 79000;
> explain cbo joincost select count(*) from customer where c_customer_sk < 
> 160500;{code}
> Both give the same estimate for the rowcount of the filter {{{}rowcount = 
> 29696.000000000004{}}}. 
> {*}Check the estimates{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk 
> between 79000 and 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 17183 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17181 |
> |     HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT, 
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 17180 |
> |       HiveTableScan(table=[[default, customer]], table:alias=[customer]): 
> rowcount = 8.0E7, cumulative cost = {0}, id = 17134 | {code}
> Please note the estimation {{rowcount = 1.0}} of the HiveFilter. The same 
> happens for range predicates with comparison operators:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk >= 
> 79000 and c_customer_sk <= 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 17088 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17086 |
> |     HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT, 
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 17085 |
> |       HiveTableScan(table=[[default, customer]], table:alias=[customer]): 
> rowcount = 8.0E7, cumulative cost = {0}, id = 17039 | {code}
> *Compare with the expected result*
> Executing the SELECT query using Trino DB gives the following result:
> {code:java}
> use tpcds.sf30000;
> trino:sf30000> select count(*) from customer where c_customer_sk between 
> 79000 and 160500;
>  _col0 
> -------
>  81501 {code}
> So the estimates should be around {{rowcount = 81501.0}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to