[jira] [Commented] (HIVE-29334) ColumnStatsAggregator#mergeHistograms

Thomas Rebele (Jira) Fri, 21 Nov 2025 08:48:38 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-29334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039918#comment-18039918
 ]


Thomas Rebele commented on HIVE-29334:
--------------------------------------

I've created a ticket for the Java Core repository of Apache DataSketches.

> ColumnStatsAggregator#mergeHistograms
> -------------------------------------
>
>                 Key: HIVE-29334
>                 URL: https://issues.apache.org/jira/browse/HIVE-29334
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Thomas Rebele
>            Assignee: Thomas Rebele
>            Priority: Major
>
> Executing a query as simple as 
> {code}
> explain cbo joincost select count(*) from catalog_returns where 
> cr_return_amount > 100;
> {code}
> on TPC-DS 30TB with histograms showed an unstable rowcount estimation for the 
> HiveFilter:
> {code}
> 0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from 
> catalog_returns where cr_return_amount > 100;
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | CBO PLAN:                                          |
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 
> cpu, 0.0 io}, id = 90 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 88 |
> |     HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount = 
> 3.363572998E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 87 |
> |       HiveTableScan(table=[[default, catalog_returns]], 
> table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost = 
> \{0}, id = 43 |
> +----------------------------------------------------+
> 0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from 
> catalog_returns where cr_return_amount > 100;
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | CBO PLAN:                                          |
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 
> cpu, 0.0 io}, id = 181 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 179 |
> |     HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount = 
> 3.365670118E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 178 |
> |       HiveTableScan(table=[[default, catalog_returns]], 
> table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost = 
> \{0}, id = 134 |
> +----------------------------------------------------+
> 0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from 
> catalog_returns where cr_return_amount > 100;
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | CBO PLAN:                                          |
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 
> cpu, 0.0 io}, id = 272 |
> |   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative 
> cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 270 |
> |     HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount = 
> 3.353988358E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 269 |
> |       HiveTableScan(table=[[default, catalog_returns]], 
> table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost = 
> \{0}, id = 225 |
> |                                                    |
> +----------------------------------------------------+
> {code}
> I've debugged a bit and followed the route of the rowcount estimate:
> * RelMdUtil#estimateFilteredRows(RelNode, RexNode, RelMetadataQuery)
> * HiveRelMdSelectivity#getSelectivity(HiveTableScan, RelMetadataQuery, 
> RexNode)
> * FilterSelectivityEstimator#computeRangePredicateSelectivity
> * ColumnStatsAggregator#mergeHistograms
> The problem is reproducible with a simpler unit test. I could reduce it to 
> using just classes of Apache DataSketches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-29334) ColumnStatsAggregator#mergeHistograms

Reply via email to