Thomas Rebele created HIVE-29366:
------------------------------------
Summary: Use histogram statistics to improve the estimates of the
equality predicates
Key: HIVE-29366
URL: https://issues.apache.org/jira/browse/HIVE-29366
Project: Hive
Issue Type: Improvement
Reporter: Thomas Rebele
The histogram statistics could be used to improve the estimates for equality
predicates.
Example on a metastore dump of a TPC-DS 30TB cluster with histograms (tried on
a commit of 2025-12-04, ca105f8124072d19d88a83b2ced613d326c9a26b):
{code:java}
explain cbo joincost select count(*) from item where i_current_price = 0.09;
explain cbo joincost select count(*) from item where i_current_price =
99.99;{code}
Both estimate the filter to have a {{{}rowcount = 49.196038760515385{}}}. This
is the same without histograms.
The histogram statistics estimate {{i_current_price >= 99.99}} to have a
{{rowcount = 1.0000000000136329}} and {{i_current_price <= 0.09}} to have a
{{{}rowcount = 2592.0{}}}. The latter rowcount estimate could be improved
though (see HIVE-29365), as {{i_current_price <= 0.12}} leads to the same
estimate {{{}rowcount = 2592.0{}}}. Calculating 2592 / 4 = 648 (because there
are 4 numbers, 0.09, 0.10, 0.11, 0.12), we get an estimate quite close to the
real value 587.
The ground truth:
{code:java}
trino:sf30000> select count(*) from item where i_current_price = 0.09;
_col0
-------
587
trino:sf30000> select count(*) from item where i_current_price = 99.99;
_col0
-------
3 {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)