[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated PARQUET-1222:
-----------------------------------
    Description: 
Currently parquet-format specifies the sort order for floating point numbers as 
follows:
{code:java}
   *   FLOAT - signed comparison of the represented value
   *   DOUBLE - signed comparison of the represented value
{code}
The problem is that the comparison of floating point numbers is only a partial 
ordering with strange behaviour in specific corner cases. For example, 
according to IEEE 754, -0 is neither less nor more than \+0 and comparing NaN 
to anything always returns false. This ordering is not suitable for statistics. 
Additionally, the Java implementation already uses a different (total) ordering 
that handles these cases correctly but differently than the C\+\+ 
implementations, which leads to interoperability problems.

TypeDefinedOrder for doubles and floats should be deprecated and a new 
TotalFloatingPointOrder should be introduced. The default for writing doubles 
and floats would be the new TotalFloatingPointOrder. The proposed ordering is 
the following:
 * -∞
 * negative numbers in their natural order
 * -0 and +0 in the same equivalence class \(!)
 * positive numbers in their natural order
 * +∞
 * all NaN values, including the negative ones \(!), in the same equivalence 
class \(!)

This ordering should be effective and easy to implement in all programming 
languages.

For reading existing stats created using TypeDefinedOrder, the following 
compatibility rules should be applied:
* When looking for NaN values, min and max should be ignored.
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is \+0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain \+0 values as well.

  was:
Currently parquet-format specifies the sort order for floating point numbers as 
follows:
{code:java}
   *   FLOAT - signed comparison of the represented value
   *   DOUBLE - signed comparison of the represented value
{code}
The problem is that the comparison of floating point numbers is only a partial 
ordering with strange behaviour in specific corner cases. For example, 
according to IEEE 754, -0 is neither less nor more than +0 and comparing NaN to 
anything always returns false. This ordering is not suitable for statistics. 
Additionally, the Java implementation already uses a different (total) ordering 
that handles these cases correctly but differently than the C++ 
implementations, which leads to interoperability problems.

We should explicitly require implementations to follow a specific comparison 
logic for these types. The proposed ordering is the following:
 * -∞
 * negative numbers in their natural order
 * -0 and +0 in the same equivalence class (!)
 * positive numbers in their natural order
 * +∞
 * all NaN values, including the negative ones (!), in the same equivalence 
class (!)

This ordering should be effective and easy to implement in all programming 
languages.


> Definition of float and double sort order is ambigious
> ------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>             Fix For: format-2.5.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. The proposed ordering is 
> the following:
>  * -∞
>  * negative numbers in their natural order
>  * -0 and +0 in the same equivalence class \(!)
>  * positive numbers in their natural order
>  * +∞
>  * all NaN values, including the negative ones \(!), in the same equivalence 
> class \(!)
> This ordering should be effective and easy to implement in all programming 
> languages.
> For reading existing stats created using TypeDefinedOrder, the following 
> compatibility rules should be applied:
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is \+0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain \+0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to