Konstantin Bereznyakov created HIVE-29332:
---------------------------------------------

             Summary: Statistics calculations incorrectly use 0 as min/max 
value for primitive numeric column types when not set
                 Key: HIVE-29332
                 URL: https://issues.apache.org/jira/browse/HIVE-29332
             Project: Hive
          Issue Type: Bug
          Components: CBO
         Environment: Using the master branch (commit 47a1973), the file 
[^hive_unset_numeric_range_bug.q] test query file generates the following 
output: [^hive_unset_numeric_range_bug.q.out]


{{As you can see, the DESCRIBE FORMATTED statement shows
min                     0                   
max                     0  
for multiple columns}}

The results of EXPLAIN EXTENDED confirm the problem with the estimated number 
of rows of 1 in multiple cases:


{{GatherStats: false
            Filter Operator
              isSamplingPred: false
              predicate: col_int BETWEEN 100 AND 500 (type: boolean)
              Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column 
stats: COMPLETE
              Select Operator
                Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
Column stats: COMPLETE
                Group By Operator
                  aggregations: count()
                  minReductionHashAggr: 0.99
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
                  Reduce Output Operator
                    bucketingVersion: 2
                    null sort order: 
                    numBuckets: -1
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE}}
            Reporter: Konstantin Bereznyakov
            Assignee: Konstantin Bereznyakov
         Attachments: hive_unset_numeric_range_bug.q, 
hive_unset_numeric_range_bug.q.out

StatsUtil.getColStatistics() code, for example, the line 
[https://github.com/apache/hive/blob/47a1973e13799a62ea6b2da094eb6d2322ca7467/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L805]
 builds a new Range object using getLowValue()/getHighValue() values even if 
the underlying Stats class instance does not have those assigned. This is 
typically not a problem for column types corresponding to Java classes; 
however, int/float/double datatypes correspond to Java primitives in Thrift 
classes. And default values for these are 0 or 0.0 if not set (i.e., min/max 
stats are unavailable).

Most code paths for other data types have more advanced logic for min/max 
values for the column Range.

The problem with defaulting to 0 for primitive numeric types is the severe 
underestimation of the number of rows when min/max values are unavailable. This 
underestimation could lead to performance implications due to the insufficient 
use of parallelism



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to