[ 
https://issues.apache.org/jira/browse/HIVE-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050206#comment-18050206
 ] 

Alexander Saydakov commented on HIVE-29365:
-------------------------------------------

Regarding KLL sketch. The set of retained values in the sketch is always a 
subset of the original input. The algorithm does not assume anything about the 
input values. They don't have to be numeric. The only requirement is a "less 
than" comparator, even if the sketch is used for a numeric type. When a sketch 
is saturated and has to discard some values, it assigns the weight of discarded 
values to adjacent values. Hence the plateaus. The resulting set of retained 
values is provably an optimal representation of the input given the approach of 
not assuming any distribution and not even having a notion of distance in the 
input domain. The rank error is the same for the whole range from 0 to 1.
Regarding t-digest. It is a very different algorithm. It is defined for numeric 
domain only and it does modify the input values. The resulting set of so-called 
centroids is not a subset of the input values (some might be sometimes). There 
is no proof of error bounds. In the current implementation the rank error is 
smaller on the extremes of the range (close to 0 and 1) and larger in the 
middle (around the median 0.5).
Regarding the CDF. Yes, it does look smooth, but that is not enough to conclude 
that it is closer to the true CDF than KLL (and it is tricky to define a 
measure of how close). It might be, but there is no proof.
it is up to you to decide which one works better for you.

> Range predicate within the same histogram bucket leads to an estimate 
> rowcount=1
> --------------------------------------------------------------------------------
>
>                 Key: HIVE-29365
>                 URL: https://issues.apache.org/jira/browse/HIVE-29365
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Thomas Rebele
>            Priority: Major
>         Attachments: expected.png, kll-k8.png, tdigest-interpolated-k10.png
>
>
> The rowcount estimate is wrong for range predicates (comparison operators as 
> well as BETWEEN) if both borders of the range are within the same bucket.
> Reason: KllFloatsSketch#getCDF calls FloatsSketchSortedView#getRank. Neither 
> interpolate the rank for a quantile between the two borders of a bucket. 
> Therefore both estimated ranks are the same if the borders of the range are 
> within the same bucket. The Wikipedia page Percentile has a [section on 
> interpolation|https://en.wikipedia.org/w/index.php?title=Percentile&oldid=1325431679#The_linear_interpolation_between_closest_ranks_method].
> ----
> Example on a metastore dump of a TPC-DS 30TB cluster with histograms (tried 
> on a commit of 2025-12-04, ca105f8124072d19d88a83b2ced613d326c9a26b): 
> {*}Locating the border{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk < 
> 79000;
> explain cbo joincost select count(*) from customer where c_customer_sk < 
> 160500;{code}
> Both give the same estimate for the rowcount of the filter {{{}rowcount = 
> 29696.000000000004{}}}. 
> {*}Check the estimates{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk 
> between 79000 and 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 17183 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17181 |
> |     HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT, 
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 17180 |
> |       HiveTableScan(table=[[default, customer]], table:alias=[customer]): 
> rowcount = 8.0E7, cumulative cost = {0}, id = 17134 | {code}
> Please note the estimation {{rowcount = 1.0}} of the HiveFilter. The same 
> happens for range predicates with comparison operators:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk >= 
> 79000 and c_customer_sk <= 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 17088 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17086 |
> |     HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT, 
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 17085 |
> |       HiveTableScan(table=[[default, customer]], table:alias=[customer]): 
> rowcount = 8.0E7, cumulative cost = {0}, id = 17039 | {code}
> *Compare with the expected result*
> Executing the SELECT query using Trino DB gives the following result:
> {code:java}
> use tpcds.sf30000;
> trino:sf30000> select count(*) from customer where c_customer_sk between 
> 79000 and 160500;
>  _col0 
> -------
>  81501 {code}
> So the estimates should be around {{rowcount = 81501.0}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to