Konstantin Bereznyakov created HIVE-29422:
---------------------------------------------

             Summary: Statistics: Negative dataSize after runtime scaling - 
scaleToRowCount() clones numNulls without scaling
                 Key: HIVE-29422
                 URL: https://issues.apache.org/jira/browse/HIVE-29422
             Project: Hive
          Issue Type: Bug
            Reporter: Konstantin Bereznyakov


Reproduced on the current master today, the problem happens when query 
execution runs into an OOM then re-executes the query using runtime statistics.

The following conditions are essential
- The initial statistics estimate expects high selectivity, while the actually 
determined runtime selectivity is much lower
- The column has a significant # of NULL records

When this happens, 
[Statistics.scaleToRowCount()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java#L340]
 scales the column stats object down to the actually calculated # of rows but 
keeps the value of numNulls. This later results in a negative data size 
estimation in 
[getDataSizeFromColumnStats()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L1815]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to