[
https://issues.apache.org/jira/browse/HIVE-29332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denys Kuzmenko updated HIVE-29332:
----------------------------------
Affects Version/s: 4.2.0
(was: 4.3.0)
> Statistics calculations incorrectly use 0 as min/max value for primitive
> numeric column types when not set
> ----------------------------------------------------------------------------------------------------------
>
> Key: HIVE-29332
> URL: https://issues.apache.org/jira/browse/HIVE-29332
> Project: Hive
> Issue Type: Bug
> Components: CBO
> Affects Versions: 4.2.0
> Environment: Using the master branch (commit 47a1973), the file
> [^hive_unset_numeric_range_bug.q] test query file generates the following
> output: [^hive_unset_numeric_range_bug.q.out]
> {{As you can see, the DESCRIBE FORMATTED statement shows
> min 0
> max 0
> for multiple columns}}
> The results of EXPLAIN EXTENDED confirm the problem with the estimated number
> of rows of 1 in multiple cases:
> {{GatherStats: false
> Filter Operator
> isSamplingPred: false
> predicate: col_int BETWEEN 100 AND 500 (type: boolean)
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: COMPLETE
> Select Operator
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: COMPLETE
> Group By Operator
> aggregations: count()
> minReductionHashAggr: 0.99
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
> Column stats: COMPLETE
> Reduce Output Operator
> bucketingVersion: 2
> null sort order:
> numBuckets: -1
> sort order:
> Statistics: Num rows: 1 Data size: 8 Basic stats:
> COMPLETE Column stats: COMPLETE}}
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Minor
> Labels: pull-request-available
> Fix For: 4.3.0
>
> Attachments: HIVE-29322 PR 6208.patch.txt,
> hive_unset_numeric_range_bug.q, hive_unset_numeric_range_bug.q.out
>
>
> StatsUtil.getColStatistics() code, for example, the line
> [https://github.com/apache/hive/blob/47a1973e13799a62ea6b2da094eb6d2322ca7467/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L805]
> builds a new Range object using getLowValue()/getHighValue() values even if
> the underlying Stats class instance does not have those assigned. This is
> typically not a problem for column types corresponding to Java classes;
> however, int/float/double datatypes correspond to Java primitives in Thrift
> classes. And default values for these are 0 or 0.0 if not set (i.e., min/max
> stats are unavailable).
> Most code paths for other data types have more advanced logic for min/max
> values for the column Range.
> The problem with defaulting to 0 for primitive numeric types is the severe
> underestimation of the number of rows when min/max values are unavailable.
> This underestimation could lead to performance implications due to the
> insufficient use of parallelism
--
This message was sent by Atlassian Jira
(v8.20.10#820010)