[
https://issues.apache.org/jira/browse/HIVE-29422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Bereznyakov updated HIVE-29422:
------------------------------------------
Description:
Reproduced on the current master today, the problem happens when query
execution runs into an OOM then re-executes the query using runtime statistics.
The following conditions are essential
- The initial statistics estimate expects high selectivity, while the actually
determined runtime selectivity is much lower
- The column has a significant # of NULL records
When this happens,
[Statistics.scaleToRowCount()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java#L340]
scales the column stats object down to the actually calculated # of rows but
keeps the value of numNulls. This could later result in a negative data size
estimation in
[getDataSizeFromColumnStats()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L1815]
The test file [^runtime_stats_negative_datasize.q] confirms the problem with
its output [^runtime_stats_negative_datasize.q.out]
{quote}Statistics: Num rows: 1 Data size: -255 Basic stats: PARTIAL Column
stats: COMPLETE
{quote}
was:
Reproduced on the current master today, the problem happens when query
execution runs into an OOM then re-executes the query using runtime statistics.
The following conditions are essential
- The initial statistics estimate expects high selectivity, while the actually
determined runtime selectivity is much lower
- The column has a significant # of NULL records
When this happens,
[Statistics.scaleToRowCount()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java#L340]
scales the column stats object down to the actually calculated # of rows but
keeps the value of numNulls. This could later result in a negative data size
estimation in
[getDataSizeFromColumnStats()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L1815]
The test file [^runtime_stats_negative_datasize.q] confirms the problem with
its output [^runtime_stats_negative_datasize.q.out]
{{Statistics: Num rows: 1 Data size: -255 Basic stats: PARTIAL Column stats:
COMPLETE
}}
> Statistics: Negative dataSize after runtime scaling - scaleToRowCount()
> clones numNulls without scaling
> -------------------------------------------------------------------------------------------------------
>
> Key: HIVE-29422
> URL: https://issues.apache.org/jira/browse/HIVE-29422
> Project: Hive
> Issue Type: Bug
> Reporter: Konstantin Bereznyakov
> Priority: Major
> Attachments: runtime_stats_negative_datasize.q,
> runtime_stats_negative_datasize.q.out
>
>
> Reproduced on the current master today, the problem happens when query
> execution runs into an OOM then re-executes the query using runtime
> statistics.
> The following conditions are essential
> - The initial statistics estimate expects high selectivity, while the
> actually determined runtime selectivity is much lower
> - The column has a significant # of NULL records
> When this happens,
> [Statistics.scaleToRowCount()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java#L340]
> scales the column stats object down to the actually calculated # of rows but
> keeps the value of numNulls. This could later result in a negative data size
> estimation in
> [getDataSizeFromColumnStats()|https://github.com/apache/hive/blob/bd816ff74369e9a8bf8fa6a1aa7eb06850c59fef/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L1815]
>
> The test file [^runtime_stats_negative_datasize.q] confirms the problem with
> its output [^runtime_stats_negative_datasize.q.out]
> {quote}Statistics: Num rows: 1 Data size: -255 Basic stats: PARTIAL Column
> stats: COMPLETE
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)