Thomas Rebele created HIVE-29334:
------------------------------------
Summary: ColumnStatsAggregator#mergeHistograms
Key: HIVE-29334
URL: https://issues.apache.org/jira/browse/HIVE-29334
Project: Hive
Issue Type: Bug
Reporter: Thomas Rebele
Assignee: Thomas Rebele
Executing a query as simple as
explain cbo joincost select count(*) from catalog_returns where
cr_return_amount > 100;
on TPC-DS 30TB with histograms showed an unstable rowcount estimation for the
HiveFilter:
{code}
0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from
catalog_returns where cr_return_amount > 100;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| CBO PLAN: |
| HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 cpu,
0.0 io}, id = 90 |
| HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative cost
= \{0.0 rows, 0.0 cpu, 0.0 io}, id = 88 |
| HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount =
3.363572998E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 87 |
| HiveTableScan(table=[[default, catalog_returns]],
table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost =
\{0}, id = 43 |
+----------------------------------------------------+
0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from
catalog_returns where cr_return_amount > 100;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| CBO PLAN: |
| HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 cpu,
0.0 io}, id = 181 |
| HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative cost
= \{0.0 rows, 0.0 cpu, 0.0 io}, id = 179 |
| HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount =
3.365670118E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 178 |
| HiveTableScan(table=[[default, catalog_returns]],
table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost =
\{0}, id = 134 |
+----------------------------------------------------+
0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from
catalog_returns where cr_return_amount > 100;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| CBO PLAN: |
| HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 cpu,
0.0 io}, id = 272 |
| HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative cost
= \{0.0 rows, 0.0 cpu, 0.0 io}, id = 270 |
| HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount =
3.353988358E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 269 |
| HiveTableScan(table=[[default, catalog_returns]],
table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost =
\{0}, id = 225 |
| |
+----------------------------------------------------+
{code}
I've debugged a bit and followed the route of the rowcount estimate:
* RelMdUtil#estimateFilteredRows(RelNode, RexNode, RelMetadataQuery)
* HiveRelMdSelectivity#getSelectivity(HiveTableScan, RelMetadataQuery, RexNode)
* FilterSelectivityEstimator#computeRangePredicateSelectivity
* ColumnStatsAggregator#mergeHistograms
The problem is reproducible with a simpler unit test. I could reduce it to
using just classes of Apache DataSketches.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)