[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1222:
--------------------------------------
    Fix Version/s: format-2.5.0

> Definition of float and double sort order is ambigious
> ------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>             Fix For: format-2.5.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> We should explicitly require implementations to follow a specific comparison 
> logic for these types. The candidates are:
>  * The [Java 
> implementation|http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999]
>  which looks easy and efficient to implement in any language.
>  * The [IEEE 754 totalOrder 
> predicate|https://github.com/rust-lang/rust/issues/5585] which is rather 
> complicated to the extent that it is hard to tell whether the Java 
> implementation adheres to it, so in effect this option may actually be the 
> same as the one above.
>  * The [IEEE 754-2008 min and max 
> operations|https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max] which 
> may be hard to use for comparison, so components could not use the same 
> sorting order to achieve the smallest possible min/max ranges (although a 
> regular sort would probably result in an almost optimal value order).
>  * We could simply require NaNs to be ignored for calculating min/max. 
> However, we should also explicitly address -0/+0 values in this case, which 
> probably leads to the option above.
> An additional problem is how to deal with existing data:
>  * One possibility is to specify legacy rules, like "if the min or max is 
> NaN, it should be ignored" or that "-0 and +0 should be considered equal for 
> min/max purposes".
>  * Another alternative is to deprecate `min_value` and `max_value` and 
> introduce `yet_another_min` and `yet_another_max` fields instead (with nicer 
> names, naturally). This could be combined with some legacy rules for the old 
> field.
>  * Probably the best solution would be to deprecate TypeDefinedOrder for 
> doubles and floats and introduce a new TotalOrder. The legacy rule "if the 
> min or max is NaN, it should be ignored" should apply to TypeDefinedOrder 
> while the new TotalOrder would not have such restrictions. The default for 
> writing doubles and floats would be the new TotalOrder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to