[
https://issues.apache.org/jira/browse/HIVE-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alessandro Solimando updated HIVE-26221:
----------------------------------------
Description:
Hive does not support histogram statistics, which are particularly useful for
skewed data (which is very common in practice) and range predicates.
Hive's current selectivity estimation for range predicates is based on a
hard-coded value of 1/3 (see
[FilterSelectivityEstimator.java#L138-L144|#L138-L144]).])
The current proposal aims at integrating histogram as an additional column
statistics, stored into the Hive metastore at the table (or partition) level.
The main requirements for histogram integration are the following:
* efficiency: the approach must scale and support billions of rows
* merge-ability: partition-level histograms have to be merged to form
table-level histograms
* explicit and configurable trade-off between memory footprint and accuracy
Hive already integrates [KLL data
sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF.
Datasketches are small, stateful programs that process massive data-streams and
can provide approximate answers, with mathematical guarantees, to
computationally difficult queries orders-of-magnitude faster than traditional,
exact methods.
We propose to use KLL, and more specifically the cumulative distribution
function (CDF) as underlying data structure for our histogram statistics.
The current proposal only targets numeric data types (float, integer and
numeric families), excluding string and temporal data types for the moment.
was:
Hive does not support histogram statistics, which are particularly useful for
skewed data (which is very common in practice) and range predicates.
Hive's current selectivity estimation for range predicates is based on a
hard-coded value of 1/3 (see
[FilterSelectivityEstimator.java#L138-L144|[https://github.com/apache/hive/blob/4622860b8c7dbddaf4c556e65c5039c60da15e82/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]
The current proposal aims at integrating histogram as an additional column
statistics, stored into the Hive metastore at the table (or partition) level.
The main requirements for histogram integration are the following:
* efficiency: the approach must scale and support billions of rows
* merge-ability: partition-level histograms have to be merged to form
table-level histograms
* explicit and configurable trade-off between memory footprint and accuracy
Hive already integrates [KLL data
sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF.
Datasketches are small, stateful programs that process massive data-streams and
can provide approximate answers, with mathematical guarantees, to
computationally difficult queries orders-of-magnitude faster than traditional,
exact methods.
We propose to use KLL, and more specifically the cumulative distribution
function (CDF) as underlying data structure for our histogram statistics.
The current proposal only targets numeric data types (float, integer and
numeric families), excluding string and temporal data types for the moment.
> Add histogram-based column statistics
> -------------------------------------
>
> Key: HIVE-26221
> URL: https://issues.apache.org/jira/browse/HIVE-26221
> Project: Hive
> Issue Type: Improvement
> Components: CBO, Metastore, Statistics
> Affects Versions: 4.0.0-alpha-2
> Reporter: Alessandro Solimando
> Assignee: Alessandro Solimando
> Priority: Major
>
> Hive does not support histogram statistics, which are particularly useful for
> skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a
> hard-coded value of 1/3 (see
> [FilterSelectivityEstimator.java#L138-L144|#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column
> statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
> * efficiency: the approach must scale and support billions of rows
> * merge-ability: partition-level histograms have to be merged to form
> table-level histograms
> * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data
> sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF.
> Datasketches are small, stateful programs that process massive data-streams
> and can provide approximate answers, with mathematical guarantees, to
> computationally difficult queries orders-of-magnitude faster than
> traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution
> function (CDF) as underlying data structure for our histogram statistics.
> The current proposal only targets numeric data types (float, integer and
> numeric families), excluding string and temporal data types for the moment.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)