[
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltan Ivanfi updated PARQUET-1222:
-----------------------------------
Description:
Currently parquet-format specifies the sort order for floating point numbers as
follows:
{code:java}
* FLOAT - signed comparison of the represented value
* DOUBLE - signed comparison of the represented value
{code}
The problem is that the comparison of floating point numbers is only a partial
ordering with strange behaviour in specific corner cases. For example,
according to IEEE 754, -0 is neither less nor more than +0 and comparing NaN to
anything always returns false. This ordering is not suitable for statistics.
Additionally, the Java implementation already uses a different (total) ordering
that handles these cases correctly but differently than the C++
implementations, which leads to interoperability problems.
We should explicitly require implementations to follow a specific comparison
logic for these types. The candidates are:
* The [Java
implementation|http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999]
which looks easy and efficient to implement in any language.
* The [IEEE 754 totalOrder
predicate|https://github.com/rust-lang/rust/issues/5585] which is rather
complicated to the extent that it is hard to tell whether the Java
implementation adheres to it, so in effect this option may actually be the same
as the one above.
* The [IEEE 754-2008 min and max
operations|https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max] which
may be hard to use for comparison, so components could not use the same sorting
order to achieve the smallest possible min/max ranges (although a regular sort
would probably result in an almost optimal value order).
* We could simply require NaNs to be ignored for calculating min/max. However,
we should also explicitly address -0/+0 values in this case, which probably
leads to the option above.
An additional problem is how to deal with existing data:
* One possibility is to specify legacy rules, like "if the min or max is NaN,
it should be ignored" or that "-0 and +0 should be considered equal for min/max
purposes".
* Another alternative is to deprecate `min_value` and `max_value` and
introduce `yet_another_min` and `yet_another_max` fields instead (with nicer
names, naturally). This could be combined with some legacy rules for the old
field.
* Probably the best solution would be to deprecate TypeDefinedOrder for
doubles and floats and introduce a new TotalOrder. The legacy rule "if the min
or max is NaN, it should be ignored" should apply to TypeDefinedOrder while the
new TotalOrder would not have such restrictions. The default for writing
doubles and floats would be the new TotalOrder.
was:
Currently parquet-format specifies the sort order for floating point numbers as
follows:
{code}
* FLOAT - signed comparison of the represented value
* DOUBLE - signed comparison of the represented value
{code}
The problem is that the comparison of floating point numbers is only a partial
ordering with strange behaviour in specific corner cases. For example,
according to IEEE 754, -0 is neither less nor more than \+0 and comparing NaN
to anything always returns false. This ordering is not suitable for statistics.
Additionally, the Java implementation already uses a different (total) ordering
that handles these cases correctly but differently than the C\+\+
implementations, which leads to interoperability problems.
We should explicitly require implementations to follow a specific comparison
logic for these types. The candidates are:
* The [Java
implementation|http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999]
which looks easy and efficient to implement in any language.
* The [IEEE 754 totalOrder
predicate|https://github.com/rust-lang/rust/issues/5585] which is rather
complicated to the extent that it is hard to tell whether the Java
implementation adheres to it.
* The [IEEE 754-2008 min and max
operations|https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max] which
may be hard to use for comparison, so components could not use sorting to
achieve the smallest possible min/max ranges.
An additional problem is dealing with existing data. One possibility is to
specify legacy rules, like "if the stats contain NaN and the file was written
by Impala, it should be ignored", but that would complicate the specs and be a
burden on implementors. In fact, `min_value` and `max_value` were introduced
because we did not want to define similar legacy rules for `min` and `max`.
Another alternative is to deprecate `min_value` and `max_value` as well and
introduce `yet_another_min` and `yet_another_max` fields instead (with nicer
names, naturally).
> Definition of float and double sort order is ambigious
> ------------------------------------------------------
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
> Issue Type: Bug
> Components: parquet-format
> Reporter: Zoltan Ivanfi
> Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers
> as follows:
> {code:java}
> * FLOAT - signed comparison of the represented value
> * DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a
> partial ordering with strange behaviour in specific corner cases. For
> example, according to IEEE 754, -0 is neither less nor more than +0 and
> comparing NaN to anything always returns false. This ordering is not suitable
> for statistics. Additionally, the Java implementation already uses a
> different (total) ordering that handles these cases correctly but differently
> than the C++ implementations, which leads to interoperability problems.
> We should explicitly require implementations to follow a specific comparison
> logic for these types. The candidates are:
> * The [Java
> implementation|http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999]
> which looks easy and efficient to implement in any language.
> * The [IEEE 754 totalOrder
> predicate|https://github.com/rust-lang/rust/issues/5585] which is rather
> complicated to the extent that it is hard to tell whether the Java
> implementation adheres to it, so in effect this option may actually be the
> same as the one above.
> * The [IEEE 754-2008 min and max
> operations|https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max] which
> may be hard to use for comparison, so components could not use the same
> sorting order to achieve the smallest possible min/max ranges (although a
> regular sort would probably result in an almost optimal value order).
> * We could simply require NaNs to be ignored for calculating min/max.
> However, we should also explicitly address -0/+0 values in this case, which
> probably leads to the option above.
> An additional problem is how to deal with existing data:
> * One possibility is to specify legacy rules, like "if the min or max is
> NaN, it should be ignored" or that "-0 and +0 should be considered equal for
> min/max purposes".
> * Another alternative is to deprecate `min_value` and `max_value` and
> introduce `yet_another_min` and `yet_another_max` fields instead (with nicer
> names, naturally). This could be combined with some legacy rules for the old
> field.
> * Probably the best solution would be to deprecate TypeDefinedOrder for
> doubles and floats and introduce a new TotalOrder. The legacy rule "if the
> min or max is NaN, it should be ignored" should apply to TypeDefinedOrder
> while the new TotalOrder would not have such restrictions. The default for
> writing doubles and floats would be the new TotalOrder.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)