Zoltán Borók-Nagy created IMPALA-6538:
-----------------------------------------

             Summary: Fix read path when Parquet min(_value)/max(_value) 
statistics contain NaN
                 Key: IMPALA-6538
                 URL: https://issues.apache.org/jira/browse/IMPALA-6538
             Project: IMPALA
          Issue Type: Sub-task
            Reporter: Zoltán Borók-Nagy


(I'll only write min and max, but I'll also mean min_value and max_value by 
that)

When both min and max is NaN:
 * Written by Impala:
 ** first element in the row group is NaN, but not all of them (Impala writer 
bug)
 ** all element is NaN
 * Written by Hive/Parquet-mr:
 ** all element is NaN

Either min or max is NaN, but not both:
 * Written by Impala:
 ** this cannot happen currently
 * Written by Hive/Parquet-mr:
 ** only the max can be NaN (needs to be checked)

Therefore, if both min and max is NaN, we can't use the statistics for 
filtering.

If only the max is NaN, we still have a valid lower bound.

 

A workaround can be to change the NaNs to infinities, ie. max => Inf, min => 
-Inf

Based on my experiments, min/max statistics are not applied to predicates that 
can be true for NaN, e.g. 'NOT x < 3'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to