[jira] [Updated] (HIVE-29334) Selectivity estimates from histograms are unstable for columns in large tables

Thomas Rebele (Jira) Wed, 26 Nov 2025 02:01:51 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-29334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Rebele updated HIVE-29334:
---------------------------------
    Affects Version/s: 4.2.0

> Selectivity estimates from histograms are unstable for columns in large tables
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-29334
>                 URL: https://issues.apache.org/jira/browse/HIVE-29334
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Thomas Rebele
>            Assignee: Thomas Rebele
>            Priority: Major
>
> Executing a query as simple as 
> {code:java}
> explain cbo joincost select count(*) from catalog_returns where 
> cr_return_amount > 100;
> {code}
> on TPC-DS 30TB with histograms showed an unstable rowcount estimation for the 
> HiveFilter:
> {code:java}
> 0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from 
> catalog_returns where cr_return_amount > 100;
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | CBO PLAN:                                          |
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 
> cpu, 0.0 io}, id = 90 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 88 |
> |     HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount = 
> 3.363572998E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 87 |
> |       HiveTableScan(table=[[default, catalog_returns]], 
> table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost = 
> \{0}, id = 43 |
> +----------------------------------------------------+
> 0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from 
> catalog_returns where cr_return_amount > 100;
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | CBO PLAN:                                          |
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 
> cpu, 0.0 io}, id = 181 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 179 |
> |     HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount = 
> 3.365670118E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 178 |
> |       HiveTableScan(table=[[default, catalog_returns]], 
> table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost = 
> \{0}, id = 134 |
> +----------------------------------------------------+
> 0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from 
> catalog_returns where cr_return_amount > 100;
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | CBO PLAN:                                          |
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 
> cpu, 0.0 io}, id = 272 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 270 |
> |     HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount = 
> 3.353988358E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 269 |
> |       HiveTableScan(table=[[default, catalog_returns]], 
> table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost = 
> \{0}, id = 225 |
> |                                                    |
> +----------------------------------------------------+
> {code}
> I've debugged a bit and followed the route of the rowcount estimate:
>  * RelMdUtil#estimateFilteredRows(RelNode, RexNode, RelMetadataQuery)
>  * HiveRelMdSelectivity#getSelectivity(HiveTableScan, RelMetadataQuery, 
> RexNode)
>  * FilterSelectivityEstimator#computeRangePredicateSelectivity
>  * ColumnStatsAggregator#mergeHistograms
> The problem is reproducible with a simpler unit test. I could reduce it using 
> just classes of Apache DataSketches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-29334) Selectivity estimates from histograms are unstable for columns in large tables

Reply via email to