To supply some context: Impala has had a number of issues
<https://issues.apache.org/jira/issues/?jql=project%3Dimpala%20and%20summary%20%20~%20NaN>
around NaN/infinity:

The closest precedent related to the current issue seems to be IMPALA-6295
<https://issues.apache.org/jira/browse/IMPALA-6295>: "Inconsistent handling
of 'nan' and 'inf' with min/max analytic fns": the discussion there offers
notable points on:
1. How Impala handles similar problems in different (but related) areas,
2. How other database products (Hive, PostgeSQL, etc.) handle similar
issues around NaNs/infinity (or infinities, in the case of IEEE-754).

Thanks,

    - LaszloG


On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi <z...@cloudera.com> wrote:

> Dear Parquet and Impala Developers,
>
> We have exposed min/max statistics to extensive compatibility testing and
> found troubling inconsistencies regarding float and double values. Under
> certain (fortunately rather extreme) circumstances, this can lead to
> predicate pushdown incorrectly discarding row groups that contain matching
> rows.
>
> The root of the problem seems to be that Impala (and probably parquet-cpp
> as well) uses C++ comparison operators for floating point numbers and those
> do not provide a total ordering. This is actually in line with IEEE 754,
> according to which -0 is neither less nor more than +0 and comparing NaN to
> anything always returns false. This, however is not suitable for statistics
> and can lead to serious consequences that you can read more about in
> IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>.
>
> The IEEE 754 standard and the Java API, on the other hand, both provide a
> total ordering, but I'm not sure whether the two are the same. The java
> implementation looks relatively simple - both easy to understand and
> effective to execute. You can check it here
> <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> classes/java/lang/Double.java#l999>.
> The IEEE 754 total ordering, on the other hand looks rather complicated to
> the extent that I can not decide whether the Java implementation adheres to
> it. I couldn't find the whole standard online, but I found an excerpt about
> the totalOrder predicate here
> <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have also
> found that IEEE 754-2008 defines min and max operations as described here
> <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that
> strangely *do not* adhere to a total ordering.
>
> I checked the specification in parquet-format but all I could find about
> floating point numbers is the following:
>
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
>
> I suggest extending the specification to explicitly require implementations
> to follow a specific comparison logic for these types. The candidates are:
>
>    - The Java implementation
>    <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> classes/java/lang/Double.java#l999>
>    which looks easy and efficient to implement in any language.
>    - The IEEE 754 totalOrder <https://github.com/rust-lang/
> rust/issues/5585>
>    predicate which honestly looks rather scary.
>    - The IEEE 754-2008 min and max
>    <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max>
> operations
>    which may be hard to use for comparison.
>
> I'm curious to hear your opinions.
>
> Thanks,
>
> Zoltan
>

Reply via email to