[
https://issues.apache.org/jira/browse/HIVE-29334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039918#comment-18039918
]
Thomas Rebele commented on HIVE-29334:
--------------------------------------
I've created a ticket for the Java Core repository of Apache DataSketches.
> ColumnStatsAggregator#mergeHistograms
> -------------------------------------
>
> Key: HIVE-29334
> URL: https://issues.apache.org/jira/browse/HIVE-29334
> Project: Hive
> Issue Type: Bug
> Reporter: Thomas Rebele
> Assignee: Thomas Rebele
> Priority: Major
>
> Executing a query as simple as
> {code}
> explain cbo joincost select count(*) from catalog_returns where
> cr_return_amount > 100;
> {code}
> on TPC-DS 30TB with histograms showed an unstable rowcount estimation for the
> HiveFilter:
> {code}
> 0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from
> catalog_returns where cr_return_amount > 100;
> +----------------------------------------------------+
> | Explain |
> +----------------------------------------------------+
> | CBO PLAN: |
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0
> cpu, 0.0 io}, id = 90 |
> | HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative
> cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 88 |
> | HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount =
> 3.363572998E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 87 |
> | HiveTableScan(table=[[default, catalog_returns]],
> table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost =
> \{0}, id = 43 |
> +----------------------------------------------------+
> 0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from
> catalog_returns where cr_return_amount > 100;
> +----------------------------------------------------+
> | Explain |
> +----------------------------------------------------+
> | CBO PLAN: |
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0
> cpu, 0.0 io}, id = 181 |
> | HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative
> cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 179 |
> | HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount =
> 3.365670118E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 178 |
> | HiveTableScan(table=[[default, catalog_returns]],
> table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost =
> \{0}, id = 134 |
> +----------------------------------------------------+
> 0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from
> catalog_returns where cr_return_amount > 100;
> +----------------------------------------------------+
> | Explain |
> +----------------------------------------------------+
> | CBO PLAN: |
> | HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0
> cpu, 0.0 io}, id = 272 |
> | HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative
> cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 270 |
> | HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount =
> 3.353988358E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 269 |
> | HiveTableScan(table=[[default, catalog_returns]],
> table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost =
> \{0}, id = 225 |
> | |
> +----------------------------------------------------+
> {code}
> I've debugged a bit and followed the route of the rowcount estimate:
> * RelMdUtil#estimateFilteredRows(RelNode, RexNode, RelMetadataQuery)
> * HiveRelMdSelectivity#getSelectivity(HiveTableScan, RelMetadataQuery,
> RexNode)
> * FilterSelectivityEstimator#computeRangePredicateSelectivity
> * ColumnStatsAggregator#mergeHistograms
> The problem is reproducible with a simpler unit test. I could reduce it to
> using just classes of Apache DataSketches.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)