To supply some context: Impala has had a number of issues <https://issues.apache.org/jira/issues/?jql=project%3Dimpala%20and%20summary%20%20~%20NaN> around NaN/infinity:
The closest precedent related to the current issue seems to be IMPALA-6295 <https://issues.apache.org/jira/browse/IMPALA-6295>: "Inconsistent handling of 'nan' and 'inf' with min/max analytic fns": the discussion there offers notable points on: 1. How Impala handles similar problems in different (but related) areas, 2. How other database products (Hive, PostgeSQL, etc.) handle similar issues around NaNs/infinity (or infinities, in the case of IEEE-754). Thanks, - LaszloG On Thu, Feb 15, 2018 at 5:10 PM, Zoltan Ivanfi <z...@cloudera.com> wrote: > Dear Parquet and Impala Developers, > > We have exposed min/max statistics to extensive compatibility testing and > found troubling inconsistencies regarding float and double values. Under > certain (fortunately rather extreme) circumstances, this can lead to > predicate pushdown incorrectly discarding row groups that contain matching > rows. > > The root of the problem seems to be that Impala (and probably parquet-cpp > as well) uses C++ comparison operators for floating point numbers and those > do not provide a total ordering. This is actually in line with IEEE 754, > according to which -0 is neither less nor more than +0 and comparing NaN to > anything always returns false. This, however is not suitable for statistics > and can lead to serious consequences that you can read more about in > IMPALA-6527 <https://issues.apache.org/jira/browse/IMPALA-6527>. > > The IEEE 754 standard and the Java API, on the other hand, both provide a > total ordering, but I'm not sure whether the two are the same. The java > implementation looks relatively simple - both easy to understand and > effective to execute. You can check it here > <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/ > classes/java/lang/Double.java#l999>. > The IEEE 754 total ordering, on the other hand looks rather complicated to > the extent that I can not decide whether the Java implementation adheres to > it. I couldn't find the whole standard online, but I found an excerpt about > the totalOrder predicate here > <https://github.com/rust-lang/rust/issues/5585>. Additionally, I have also > found that IEEE 754-2008 defines min and max operations as described here > <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> that > strangely *do not* adhere to a total ordering. > > I checked the specification in parquet-format but all I could find about > floating point numbers is the following: > > * FLOAT - signed comparison of the represented value > * DOUBLE - signed comparison of the represented value > > I suggest extending the specification to explicitly require implementations > to follow a specific comparison logic for these types. The candidates are: > > - The Java implementation > <http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/ > classes/java/lang/Double.java#l999> > which looks easy and efficient to implement in any language. > - The IEEE 754 totalOrder <https://github.com/rust-lang/ > rust/issues/5585> > predicate which honestly looks rather scary. > - The IEEE 754-2008 min and max > <https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max> > operations > which may be hard to use for comparison. > > I'm curious to hear your opinions. > > Thanks, > > Zoltan >