[ 
https://issues.apache.org/jira/browse/PARQUET-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16443868#comment-16443868
 ] 

ASF GitHub Bot commented on PARQUET-1217:
-----------------------------------------

gszadovszky opened a new pull request #465: PARQUET-1217: Incorrect handling of 
missing values in Statistics
URL: https://github.com/apache/parquet-mr/pull/465
 
 
   In parquet-format every value in Statistics is optional while parquet-mr 
does not properly handle these scenarios:
   - null_count is set but min/max or min_value/max_value are not: filtering 
may fail with NPE or incorrect filtering occurs
     fix: check if min/max is set before comparing to the related values
   - null_count is not set: filtering handles null_count as if it would be 0 -> 
incorrect filtering may occur
     fix: introduce new method in Statistics object to check if num_nulls is 
set; check if num_nulls is set by the new method before using its value for 
filtering
   
   Author: Gabor Szadovszky <[email protected]>
   
   Closes #458 from gszadovszky/PARQUET-1217 and squashes the following commits:
   
   9d14090 [Gabor Szadovszky] Updates according to rdblue's comments
   116d1d3 [Gabor Szadovszky] PARQUET-1217: Updates according to zi's comments
   c264b50 [Gabor Szadovszky] PARQUET-1217: fix handling of unset nullCount
   2ec2fb1 [Gabor Szadovszky] PARQUET-1217: Incorrect handling of missing 
values in Statistics
   
   This change is based on b82d96218bfd37f6df95a2e8d7675d091ab61970 but is not 
a clean cherry-pick.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Incorrect handling of missing values in Statistics
> --------------------------------------------------
>
>                 Key: PARQUET-1217
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1217
>             Project: Parquet
>          Issue Type: Bug
>    Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0, 1.10.0
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>             Fix For: 1.10.0
>
>
> As per the parquet-format specs the min/max values in statistics are 
> optional. Therefore, it is possible to have {{numNulls}} in {{Statistics}} 
> while we don't have min/max values. In {{StatisticsFilter}} we rely on the 
> method 
> [StatisticsFilter.isAllNulls(ColumnChunkMetaData)|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java#L90]
>  to handle the case of {{null}} min/max values which is not correct due to 
> the described scenario. 
>  We shall check {{Statistics.hasNonNullValue()}} any time before using the 
> actual min/max values.
> In addition we don't check if the {{null_count}} is set or not when reading 
> from the parquet file. We simply use the value which is {{0}} in case of 
> unset. In the parquet-mr side the {{Statistics}} object uses the value {{0}} 
> to sign that the {{num_nulls}} is unset. It is incorrect if we are searching 
> for null values and we falsely drop a column chunk thinking there are no null 
> values but the field in the statistics was simply unset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to