We could also consider treating NaN similar to NULL and having a separate
piece of information with a count of NaN values (or just a bit indicating
presence/absence of NaN). I'm not sure if that is easier or harder to
implement than a total order.

On Thu, Feb 15, 2018 at 9:12 AM, Laszlo Gaal <laszlo.g...@cloudera.com>
wrote:

> To supply some context: Impala has had a number of issues
> <https://issues.apache.org/jira/issues/?jql=project%
> 3Dimpala%20and%20summary%20%20~%20NaN>
> around NaN/infinity:
>
> The closest precedent related to the current issue seems to be IMPALA-6295
> <https://issues.apache.org/jira/browse/IMPALA-6295>: "Inconsistent
> handling
> of 'nan' and 'inf' with min/max analytic fns": the discussion there offers
> notable points on:
> 1. How Impala handles similar problems in different (but related) areas,
> 2. How other database products (Hive, PostgeSQL, etc.) handle similar
> issues around NaNs/infinity (or infinities, in the case of IEEE-754).
>
> Thanks,
>
>     - LaszloG
>
>
> On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi <z...@cloudera.com> wrote:
>
> > Dear Parquet and Impala Developers,
> >
> > We have exposed min/max statistics to extensive compatibility testing and
> > found troubling inconsistencies regarding float and double values. Under
> > certain (fortunately rather extreme) circumstances, this can lead to
> > predicate pushdown incorrectly discarding row groups that contain
> matching
> > rows.
> >
> > The root of the problem seems to be that Impala (and probably parquet-cpp
> > as well) uses C++ comparison operators for floating point numbers and
> those
> > do not provide a total ordering. This is actually in line with IEEE 754,
> > according to which -0 is neither less nor more than +0 and comparing NaN
> to
> > anything always returns false. This, however is not suitable for
> statistics
> > and can lead to serious consequences that you can read more about in
> > IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>.
> >
> > The IEEE 754 standard and the Java API, on the other hand, both provide a
> > total ordering, but I'm not sure whether the two are the same. The java
> > implementation looks relatively simple - both easy to understand and
> > effective to execute. You can check it here
> > <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> > classes/java/lang/Double.java#l999>.
> > The IEEE 754 total ordering, on the other hand looks rather complicated
> to
> > the extent that I can not decide whether the Java implementation adheres
> to
> > it. I couldn't find the whole standard online, but I found an excerpt
> about
> > the totalOrder predicate here
> > <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have
> also
> > found that IEEE 754-2008 defines min and max operations as described here
> > <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that
> > strangely *do not* adhere to a total ordering.
> >
> > I checked the specification in parquet-format but all I could find about
> > floating point numbers is the following:
> >
> >    *   FLOAT - signed comparison of the represented value
> >    *   DOUBLE - signed comparison of the represented value
> >
> > I suggest extending the specification to explicitly require
> implementations
> > to follow a specific comparison logic for these types. The candidates
> are:
> >
> >    - The Java implementation
> >    <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/
> 687fd7c7986d/src/share/
> > classes/java/lang/Double.java#l999>
> >    which looks easy and efficient to implement in any language.
> >    - The IEEE 754 totalOrder <https://github.com/rust-lang/
> > rust/issues/5585>
> >    predicate which honestly looks rather scary.
> >    - The IEEE 754-2008 min and max
> >    <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max>
> > operations
> >    which may be hard to use for comparison.
> >
> > I'm curious to hear your opinions.
> >
> > Thanks,
> >
> > Zoltan
> >
>

Reply via email to