Thomas Rebele created HIVE-29334:
------------------------------------

             Summary: ColumnStatsAggregator#mergeHistograms
                 Key: HIVE-29334
                 URL: https://issues.apache.org/jira/browse/HIVE-29334
             Project: Hive
          Issue Type: Bug
            Reporter: Thomas Rebele
            Assignee: Thomas Rebele


Executing a query as simple as 

explain cbo joincost select count(*) from catalog_returns where 
cr_return_amount > 100;

on TPC-DS 30TB with histograms showed an unstable rowcount estimation for the 
HiveFilter:

{code}
0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from 
catalog_returns where cr_return_amount > 100;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| CBO PLAN:                                          |
| HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 cpu, 
0.0 io}, id = 90 |
|   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative cost 
= \{0.0 rows, 0.0 cpu, 0.0 io}, id = 88 |
|     HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount = 
3.363572998E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 87 |
|       HiveTableScan(table=[[default, catalog_returns]], 
table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost = 
\{0}, id = 43 |
+----------------------------------------------------+

0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from 
catalog_returns where cr_return_amount > 100;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| CBO PLAN:                                          |
| HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 cpu, 
0.0 io}, id = 181 |
|   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative cost 
= \{0.0 rows, 0.0 cpu, 0.0 io}, id = 179 |
|     HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount = 
3.365670118E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 178 |
|       HiveTableScan(table=[[default, catalog_returns]], 
table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost = 
\{0}, id = 134 |
+----------------------------------------------------+

0: jdbc:hive2://localhost:10002> explain cbo joincost select count(*) from 
catalog_returns where cr_return_amount > 100;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| CBO PLAN:                                          |
| HiveProject(_c0=[$0]): rowcount = 1.0, cumulative cost = \{0.0 rows, 0.0 cpu, 
0.0 io}, id = 272 |
|   HiveAggregate(group=[{}], agg#0=[count()]): rowcount = 1.0, cumulative cost 
= \{0.0 rows, 0.0 cpu, 0.0 io}, id = 270 |
|     HiveFilter(condition=[>($17, 100:DECIMAL(3, 0))]): rowcount = 
3.353988358E9, cumulative cost = \{0.0 rows, 0.0 cpu, 0.0 io}, id = 269 |
|       HiveTableScan(table=[[default, catalog_returns]], 
table:alias=[catalog_returns]): rowcount = 4.320980099E9, cumulative cost = 
\{0}, id = 225 |
|                                                    |
+----------------------------------------------------+
{code}

I've debugged a bit and followed the route of the rowcount estimate:

* RelMdUtil#estimateFilteredRows(RelNode, RexNode, RelMetadataQuery)
* HiveRelMdSelectivity#getSelectivity(HiveTableScan, RelMetadataQuery, RexNode)
* FilterSelectivityEstimator#computeRangePredicateSelectivity
* ColumnStatsAggregator#mergeHistograms

The problem is reproducible with a simpler unit test. I could reduce it to 
using just classes of Apache DataSketches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to