[
https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=830981&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830981
]
ASF GitHub Bot logged work on HIVE-26221:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 05/Dec/22 10:14
Start Date: 05/Dec/22 10:14
Worklog Time Spent: 10m
Work Description: asolimando commented on code in PR #3137:
URL: https://github.com/apache/hive/pull/3137#discussion_r1039397581
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/LongColumnStatsAggregator.java:
##########
@@ -51,73 +52,94 @@ public ColumnStatisticsObj
aggregate(List<ColStatsObjWithSourceInfo> colStatsWit
checkStatisticsList(colStatsWithSourceInfo);
ColumnStatisticsObj statsObj = null;
- String colType = null;
+ String colType;
String colName = null;
// check if all the ColumnStatisticsObjs contain stats and all the ndv are
// bitvectors
boolean doAllPartitionContainStats = partNames.size() ==
colStatsWithSourceInfo.size();
NumDistinctValueEstimator ndvEstimator = null;
+ KllHistogramEstimator histogramEstimator = null;
+ boolean areAllNDVEstimatorsMergeable = true;
+ boolean areAllHistogramEstimatorsMergeable = true;
for (ColStatsObjWithSourceInfo csp : colStatsWithSourceInfo) {
ColumnStatisticsObj cso = csp.getColStatsObj();
if (statsObj == null) {
colName = cso.getColName();
colType = cso.getColType();
statsObj = ColumnStatsAggregatorFactory.newColumnStaticsObj(colName,
colType,
cso.getStatsData().getSetField());
- LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName,
- doAllPartitionContainStats);
+ LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName,
doAllPartitionContainStats);
}
- LongColumnStatsDataInspector longColumnStatsData =
longInspectorFromStats(cso);
- if (longColumnStatsData.getNdvEstimator() == null) {
- ndvEstimator = null;
- break;
- } else {
- // check if all of the bit vectors can merge
- NumDistinctValueEstimator estimator =
longColumnStatsData.getNdvEstimator();
+ LongColumnStatsDataInspector columnStatsData =
longInspectorFromStats(cso);
+
+ // check if we can merge NDV estimators
+ if (columnStatsData.getNdvEstimator() == null) {
+ areAllNDVEstimatorsMergeable = false;
+ } else if (areAllNDVEstimatorsMergeable) {
+ NumDistinctValueEstimator estimator =
columnStatsData.getNdvEstimator();
if (ndvEstimator == null) {
ndvEstimator = estimator;
} else {
- if (ndvEstimator.canMerge(estimator)) {
- continue;
- } else {
- ndvEstimator = null;
- break;
+ if (!ndvEstimator.canMerge(estimator)) {
+ areAllNDVEstimatorsMergeable = false;
+ }
+ }
+ }
+ // check if we can merge histogram estimators
+ if (columnStatsData.getHistogramEstimator() == null) {
Review Comment:
You are right, I have double checked and indeed it can be simplified as you
suggest, I have added those fours lines you have cited right before the
`return` statement of `aggregate` method and it's enough.
Issue Time Tracking
-------------------
Worklog Id: (was: 830981)
Time Spent: 3.5h (was: 3h 20m)
> Add histogram-based column statistics
> -------------------------------------
>
> Key: HIVE-26221
> URL: https://issues.apache.org/jira/browse/HIVE-26221
> Project: Hive
> Issue Type: Improvement
> Components: CBO, Metastore, Statistics
> Affects Versions: 4.0.0-alpha-2
> Reporter: Alessandro Solimando
> Assignee: Alessandro Solimando
> Priority: Major
> Labels: pull-request-available
> Time Spent: 3.5h
> Remaining Estimate: 0h
>
> Hive does not support histogram statistics, which are particularly useful for
> skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a
> hard-coded value of 1/3 (see
> [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column
> statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
> * efficiency: the approach must scale and support billions of rows
> * merge-ability: partition-level histograms have to be merged to form
> table-level histograms
> * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data
> sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF.
> Datasketches are small, stateful programs that process massive data-streams
> and can provide approximate answers, with mathematical guarantees, to
> computationally difficult queries orders-of-magnitude faster than
> traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution
> function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric
> families) and temporal data types (date and timestamp).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)