Hi,

I don't think it would be worth to keep a separate NaN count, but we could
ignore them when calculating min/max stats regardless. However, NaN is not
the only value preventing total ordering. We could have a similar problem
with not finding +0.0 values because a -0.0 is written to the max_value
field by some component that considers them the same.

One more thing to consider is how to deal with existing data when we update
the specs. One possibility is to specify legacy rules, like "if the stats
contain NaN and the file was written by Impala, it should be ignored", but
that would complicate the specs and be a burden on implementors. In fact,
min_value and max_value were introduced because we did not want to define
similar legacy rules for min and max. So should we deprecate min_value and
max_value as well and introduce yet_another_min and yet_another_max fields
instead (with nicer names, naturally)?

Br,

Zoltan

On Thu, Feb 15, 2018 at 8:01 PM Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> We could also consider treating NaN similar to NULL and having a separate
> piece of information with a count of NaN values (or just a bit indicating
> presence/absence of NaN). I'm not sure if that is easier or harder to
> implement than a total order.
>
> On Thu, Feb 15, 2018 at 9:12 AM, Laszlo Gaal <laszlo.g...@cloudera.com>
> wrote:
>
> > To supply some context: Impala has had a number of issues
> > <https://issues.apache.org/jira/issues/?jql=project%
> > 3Dimpala%20and%20summary%20%20~%20NaN>
> > around NaN/infinity:
> >
> > The closest precedent related to the current issue seems to be
> IMPALA-6295
> > <https://issues.apache.org/jira/browse/IMPALA-6295>: "Inconsistent
> > handling
> > of 'nan' and 'inf' with min/max analytic fns": the discussion there
> offers
> > notable points on:
> > 1. How Impala handles similar problems in different (but related) areas,
> > 2. How other database products (Hive, PostgeSQL, etc.) handle similar
> > issues around NaNs/infinity (or infinities, in the case of IEEE-754).
> >
> > Thanks,
> >
> >     - LaszloG
> >
> >
> > On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi <z...@cloudera.com> wrote:
> >
> > > Dear Parquet and Impala Developers,
> > >
> > > We have exposed min/max statistics to extensive compatibility testing
> and
> > > found troubling inconsistencies regarding float and double values.
> Under
> > > certain (fortunately rather extreme) circumstances, this can lead to
> > > predicate pushdown incorrectly discarding row groups that contain
> > matching
> > > rows.
> > >
> > > The root of the problem seems to be that Impala (and probably
> parquet-cpp
> > > as well) uses C++ comparison operators for floating point numbers and
> > those
> > > do not provide a total ordering. This is actually in line with IEEE
> 754,
> > > according to which -0 is neither less nor more than +0 and comparing
> NaN
> > to
> > > anything always returns false. This, however is not suitable for
> > statistics
> > > and can lead to serious consequences that you can read more about in
> > > IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>.
> > >
> > > The IEEE 754 standard and the Java API, on the other hand, both
> provide a
> > > total ordering, but I'm not sure whether the two are the same. The java
> > > implementation looks relatively simple - both easy to understand and
> > > effective to execute. You can check it here
> > > <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/
> > > classes/java/lang/Double.java#l999>.
> > > The IEEE 754 total ordering, on the other hand looks rather complicated
> > to
> > > the extent that I can not decide whether the Java implementation
> adheres
> > to
> > > it. I couldn't find the whole standard online, but I found an excerpt
> > about
> > > the totalOrder predicate here
> > > <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have
> > also
> > > found that IEEE 754-2008 defines min and max operations as described
> here
> > > <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that
> > > strangely *do not* adhere to a total ordering.
> > >
> > > I checked the specification in parquet-format but all I could find
> about
> > > floating point numbers is the following:
> > >
> > >    *   FLOAT - signed comparison of the represented value
> > >    *   DOUBLE - signed comparison of the represented value
> > >
> > > I suggest extending the specification to explicitly require
> > implementations
> > > to follow a specific comparison logic for these types. The candidates
> > are:
> > >
> > >    - The Java implementation
> > >    <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/
> > 687fd7c7986d/src/share/
> > > classes/java/lang/Double.java#l999>
> > >    which looks easy and efficient to implement in any language.
> > >    - The IEEE 754 totalOrder <https://github.com/rust-lang/
> > > rust/issues/5585>
> > >    predicate which honestly looks rather scary.
> > >    - The IEEE 754-2008 min and max
> > >    <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max>
> > > operations
> > >    which may be hard to use for comparison.
> > >
> > > I'm curious to hear your opinions.
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> >
>

Reply via email to