[jira] [Updated] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

Jira Wed, 20 Apr 2022 03:35:05 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Raúl Cumplido updated ARROW-12264:
----------------------------------
    Fix Version/s: 9.0.0
                       (was: 8.0.0)

> [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
> -------------------------------------------------------------------
>
>                 Key: ARROW-12264
>                 URL: https://issues.apache.org/jira/browse/ARROW-12264
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: C++, Parquet
>            Reporter: Antoine Pitrou
>            Priority: Major
>             Fix For: 9.0.0
>
>
> The Parquet spec (in parquet.thrift) says the following about handling of 
> floating-point statistics:
> {code}
>    * (*) Because the sorting order is not specified properly for floating
>    *     point values (relations vs. total ordering) the following
>    *     compatibility rules should be applied when reading statistics:
>    *     - If the min is a NaN, it should be ignored.
>    *     - If the max is a NaN, it should be ignored.
>    *     - If the min is +0, the row group may contain -0 values as well.
>    *     - If the max is -0, the row group may contain +0 values as well.
>    *     - When looking for NaN values, min and max should be ignored.
> {code}
> It appears that the dataset code uses the following filter expression when 
> doing Parquet predicate push-down (in {{file_parquet.cc}}):
> {code:c++}
>     return and_(greater_equal(field_expr, literal(min)),
>                 less_equal(field_expr, literal(max)));
> {code}
> A NaN value will fail that filter and yet may be found in the given Parquet 
> column chunk.
> We may instead need a "greater_equal_or_nan" comparison that returns true if 
> either value is NaN.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

Reply via email to