[ 
https://issues.apache.org/jira/browse/HIVE-29422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Bereznyakov updated HIVE-29422:
------------------------------------------
    Description: 
Reproduced on the current master today, the problem happens when query 
execution runs into an OOM then re-executes the query using runtime statistics.

The following conditions are essential
 - The initial statistics estimate expects high selectivity, while the actually 
determined runtime selectivity is much lower
 - The column has a significant # of NULL records

When this happens, 
[Statistics.scaleToRowCount()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java#L340]
 scales the column stats object down to the actually calculated # of rows but 
keeps the value of numNulls. This could later result in a negative data size 
estimation in 
[getDataSizeFromColumnStats()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L1815]

 

{{The test file [^runtime_stats_negative_datasize.q] confirms the problem with 
its output [^runtime_stats_negative_datasize.q.out]

Statistics: Num rows: 1 Data size: -255 Basic stats: PARTIAL Column stats: 
COMPLETE
}}

  was:
Reproduced on the current master today, the problem happens when query 
execution runs into an OOM then re-executes the query using runtime statistics.

The following conditions are essential
 - The initial statistics estimate expects high selectivity, while the actually 
determined runtime selectivity is much lower
 - The column has a significant # of NULL records

When this happens, 
[Statistics.scaleToRowCount()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java#L340]
 scales the column stats object down to the actually calculated # of rows but 
keeps the value of numNulls. This could later result in a negative data size 
estimation in 
[getDataSizeFromColumnStats()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L1815]


> Statistics: Negative dataSize after runtime scaling - scaleToRowCount() 
> clones numNulls without scaling
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29422
>                 URL: https://issues.apache.org/jira/browse/HIVE-29422
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Konstantin Bereznyakov
>            Priority: Major
>         Attachments: runtime_stats_negative_datasize.q, 
> runtime_stats_negative_datasize.q.out
>
>
> Reproduced on the current master today, the problem happens when query 
> execution runs into an OOM then re-executes the query using runtime 
> statistics.
> The following conditions are essential
>  - The initial statistics estimate expects high selectivity, while the 
> actually determined runtime selectivity is much lower
>  - The column has a significant # of NULL records
> When this happens, 
> [Statistics.scaleToRowCount()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java#L340]
>  scales the column stats object down to the actually calculated # of rows but 
> keeps the value of numNulls. This could later result in a negative data size 
> estimation in 
> [getDataSizeFromColumnStats()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L1815]
>  
> {{The test file [^runtime_stats_negative_datasize.q] confirms the problem 
> with its output [^runtime_stats_negative_datasize.q.out]
> Statistics: Num rows: 1 Data size: -255 Basic stats: PARTIAL Column stats: 
> COMPLETE
> }}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to