Zoltan Ivanfi created IMPALA-6527:
-------------------------------------

             Summary: NaN values lead to incorrect filtering under certain 
circumstances
                 Key: IMPALA-6527
                 URL: https://issues.apache.org/jira/browse/IMPALA-6527
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
            Reporter: Zoltan Ivanfi


h1. Summary

If the first number in a row group written by Impala is NaN, then Impala writes 
incorrect statistics in the metadata. This will result in incorrect results 
when filtering the data.
h1. Reproduction

First, create a Parquet table with a double column:
{noformat}
create table test_nan(val double) stored as parquet;
{noformat}
Insert two values in a single statement, the first of which is a NaN:
{noformat}
insert into test_nan values (cast('NaN' as double)), (42);
{noformat}
Check that both values are actually present in the table:
{noformat}
select * from test_nan;
+-----+
| val |
+-----+
| NaN |
| 42  |
+-----+
Fetched 2 row(s) in 0.13s
{noformat}
Filter using a condition that should match the regular number:
{noformat}
select * from test_nan where val > 0;
Fetched 0 row(s) in 0.13s
{noformat}
*Expectation*: The row with the regular number should be returned.
 *Actual result*: No rows are returned.
h1. Explanation

Parquet files contain statistics metadata including the fields {{min}} and 
{{max}} or {{min_value}} and {{max_value}} (depending on the Impala version). 
If the first number is a NaN, the minimum and maximum values that Impala writes 
in the metadata are NaN. Based on this metadata, the row group can not contain 
any value that matches the condition, thereby Impala discards its contents 
without checking the individual entries. The problem is that the statistics 
were incorrectly written in the first place. (This can be and has been checked 
by using {{parquet-tools meta}} on the Parquet file.)

What follows are just my assumptions without checking the actual code: While 
writing data, Impala keeps track of the smallest and largest value encountered 
so far. Let's call them min_so_far and max_so_far, respectively.

Initially, the first (non_NULL) value is set as both the min_so_far and 
max_so_far. Then each new value is compared against min_so_far and max_so_far, 
updating each one if necessary. In pseudo_code:
{code:java}
if (new_value < min_so_far) {
  min_so_far = new_value;
}
{code}
The problem is that any comparison involving NaN returns false, thereby if NaN 
is already in min_so_far, then no value can ever replace it and NaN will be 
stuck there.

On the positive side, min_so_far can only become NaN if the first value in the 
row group is NaN. If the first value is not NaN, then NaN can never replace 
min_so_far, since the comparison will always return false when it involves a 
NaN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to