[ 
https://issues.apache.org/jira/browse/HIVE-29332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-29332:
----------------------------------
    Fix Version/s: 4.3.0
       Resolution: Fixed
           Status: Resolved  (was: Patch Available)

> Statistics calculations incorrectly use 0 as min/max value for primitive 
> numeric column types when not set
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29332
>                 URL: https://issues.apache.org/jira/browse/HIVE-29332
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>    Affects Versions: 4.3.0
>         Environment: Using the master branch (commit 47a1973), the file 
> [^hive_unset_numeric_range_bug.q] test query file generates the following 
> output: [^hive_unset_numeric_range_bug.q.out]
> {{As you can see, the DESCRIBE FORMATTED statement shows
> min                     0                   
> max                     0  
> for multiple columns}}
> The results of EXPLAIN EXTENDED confirm the problem with the estimated number 
> of rows of 1 in multiple cases:
> {{GatherStats: false
>             Filter Operator
>               isSamplingPred: false
>               predicate: col_int BETWEEN 100 AND 500 (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: COMPLETE
>               Select Operator
>                 Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Group By Operator
>                   aggregations: count()
>                   minReductionHashAggr: 0.99
>                   mode: hash
>                   outputColumnNames: _col0
>                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   Reduce Output Operator
>                     bucketingVersion: 2
>                     null sort order: 
>                     numBuckets: -1
>                     sort order: 
>                     Statistics: Num rows: 1 Data size: 8 Basic stats: 
> COMPLETE Column stats: COMPLETE}}
>            Reporter: Konstantin Bereznyakov
>            Assignee: Konstantin Bereznyakov
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 4.3.0
>
>         Attachments: HIVE-29322 PR 6208.patch.txt, 
> hive_unset_numeric_range_bug.q, hive_unset_numeric_range_bug.q.out
>
>
> StatsUtil.getColStatistics() code, for example, the line 
> [https://github.com/apache/hive/blob/47a1973e13799a62ea6b2da094eb6d2322ca7467/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L805]
>  builds a new Range object using getLowValue()/getHighValue() values even if 
> the underlying Stats class instance does not have those assigned. This is 
> typically not a problem for column types corresponding to Java classes; 
> however, int/float/double datatypes correspond to Java primitives in Thrift 
> classes. And default values for these are 0 or 0.0 if not set (i.e., min/max 
> stats are unavailable).
> Most code paths for other data types have more advanced logic for min/max 
> values for the column Range.
> The problem with defaulting to 0 for primitive numeric types is the severe 
> underestimation of the number of rows when min/max values are unavailable. 
> This underestimation could lead to performance implications due to the 
> insufficient use of parallelism



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to