[
https://issues.apache.org/jira/browse/HIVE-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18044706#comment-18044706
]
Alexander Saydakov commented on HIVE-29365:
-------------------------------------------
what is your parameter k for KLL sketch?
> Range predicate within the same histogram bucket leads to an estimate
> rowcount=1
> --------------------------------------------------------------------------------
>
> Key: HIVE-29365
> URL: https://issues.apache.org/jira/browse/HIVE-29365
> Project: Hive
> Issue Type: Bug
> Affects Versions: 4.2.0
> Reporter: Thomas Rebele
> Priority: Major
>
> The rowcount estimate is wrong for range predicates (comparison operators as
> well as BETWEEN) if both borders of the range are within the same bucket.
> Reason: KllFloatsSketch#getCDF calls FloatsSketchSortedView#getRank. Neither
> interpolate the rank for a quantile between the two borders of a bucket.
> Therefore both estimated ranks are the same if the borders of the range are
> within the same bucket. The Wikipedia page Percentile has a [section on
> interpolation|https://en.wikipedia.org/w/index.php?title=Percentile&oldid=1325431679#The_linear_interpolation_between_closest_ranks_method].
> ----
> Example on a metastore dump of a TPC-DS 30TB cluster with histograms (tried
> on a commit of 2025-12-04, ca105f8124072d19d88a83b2ced613d326c9a26b):
> {*}Locating the border{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk <
> 79000;
> explain cbo joincost select count(*) from customer where c_customer_sk <
> 160500;{code}
> Both give the same estimate for the rowcount of the filter {{{}rowcount =
> 29696.000000000004{}}}.
> {*}Check the estimates{*}:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk
> between 79000 and 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0
> cpu, 0.0 io}, id = 17183 |
> | HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17181 |
> | HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT,
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
> io}, id = 17180 |
> | HiveTableScan(table=[[default, customer]], table:alias=[customer]):
> rowcount = 8.0E7, cumulative cost = {0}, id = 17134 | {code}
> Please note the estimation {{rowcount = 1.0}} of the HiveFilter. The same
> happens for range predicates with comparison operators:
> {code:java}
> explain cbo joincost select count(*) from customer where c_customer_sk >=
> 79000 and c_customer_sk <= 160500; {code}
> returns a plan
> {code:java}
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0
> cpu, 0.0 io}, id = 17088 |
> | HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative
> cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 17086 |
> | HiveFilter(condition=[BETWEEN(false, $0, 79000:BIGINT,
> 160500:BIGINT)]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
> io}, id = 17085 |
> | HiveTableScan(table=[[default, customer]], table:alias=[customer]):
> rowcount = 8.0E7, cumulative cost = {0}, id = 17039 | {code}
> *Compare with the expected result*
> Executing the SELECT query using Trino DB gives the following result:
> {code:java}
> use tpcds.sf30000;
> trino:sf30000> select count(*) from customer where c_customer_sk between
> 79000 and 160500;
> _col0
> -------
> 81501 {code}
> So the estimates should be around {{rowcount = 81501.0}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)