Konstantin Bereznyakov created HIVE-29332:
---------------------------------------------
Summary: Statistics calculations incorrectly use 0 as min/max
value for primitive numeric column types when not set
Key: HIVE-29332
URL: https://issues.apache.org/jira/browse/HIVE-29332
Project: Hive
Issue Type: Bug
Components: CBO
Environment: Using the master branch (commit 47a1973), the file
[^hive_unset_numeric_range_bug.q] test query file generates the following
output: [^hive_unset_numeric_range_bug.q.out]
{{As you can see, the DESCRIBE FORMATTED statement shows
min 0
max 0
for multiple columns}}
The results of EXPLAIN EXTENDED confirm the problem with the estimated number
of rows of 1 in multiple cases:
{{GatherStats: false
Filter Operator
isSamplingPred: false
predicate: col_int BETWEEN 100 AND 500 (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
stats: COMPLETE
Select Operator
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
Column stats: COMPLETE
Group By Operator
aggregations: count()
minReductionHashAggr: 0.99
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: COMPLETE
Reduce Output Operator
bucketingVersion: 2
null sort order:
numBuckets: -1
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: COMPLETE}}
Reporter: Konstantin Bereznyakov
Assignee: Konstantin Bereznyakov
Attachments: hive_unset_numeric_range_bug.q,
hive_unset_numeric_range_bug.q.out
StatsUtil.getColStatistics() code, for example, the line
[https://github.com/apache/hive/blob/47a1973e13799a62ea6b2da094eb6d2322ca7467/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L805]
builds a new Range object using getLowValue()/getHighValue() values even if
the underlying Stats class instance does not have those assigned. This is
typically not a problem for column types corresponding to Java classes;
however, int/float/double datatypes correspond to Java primitives in Thrift
classes. And default values for these are 0 or 0.0 if not set (i.e., min/max
stats are unavailable).
Most code paths for other data types have more advanced logic for min/max
values for the column Range.
The problem with defaulting to 0 for primitive numeric types is the severe
underestimation of the number of rows when min/max values are unavailable. This
underestimation could lead to performance implications due to the insufficient
use of parallelism
--
This message was sent by Atlassian Jira
(v8.20.10#820010)