[jira] [Created] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

Antoine Pitrou (Jira) Wed, 07 Apr 2021 09:26:04 -0700

Antoine Pitrou created ARROW-12264:
--------------------------------------

             Summary: [C++][Dataset] Handle NaNs correctly in Parquet predicate 
push-down
                 Key: ARROW-12264
                 URL: https://issues.apache.org/jira/browse/ARROW-12264
             Project: Apache Arrow
          Issue Type: Task
          Components: C++, Parquet
            Reporter: Antoine Pitrou
             Fix For: 5.0.0



The Parquet spec (in parquet.thrift) says the following about handling of 
floating-point statistics:
{code}
   * (*) Because the sorting order is not specified properly for floating
   *     point values (relations vs. total ordering) the following
   *     compatibility rules should be applied when reading statistics:
   *     - If the min is a NaN, it should be ignored.
   *     - If the max is a NaN, it should be ignored.
   *     - If the min is +0, the row group may contain -0 values as well.
   *     - If the max is -0, the row group may contain +0 values as well.
   *     - When looking for NaN values, min and max should be ignored.
{code}

It appears that the dataset code uses the following filter expression when 
doing Parquet predicate push-down (in {{file_parquet.cc}}):
{code:c++}
    return and_(greater_equal(field_expr, literal(min)),
                less_equal(field_expr, literal(max)));
{code}

A NaN value will fail that filter and yet may be found in the given Parquet 
column chunk.

We may instead need a "greater_equal_or_nan" comparison that returns true if 
either value is NaN.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

Reply via email to