[jira] [Updated] (PARQUET-1222) Definition of float and double sort order is ambigious

Zoltan Ivanfi (JIRA) Wed, 21 Feb 2018 07:49:38 -0800

     [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zoltan Ivanfi updated PARQUET-1222:
-----------------------------------
    Description: 
Currently parquet-format specifies the sort order for floating point numbers as 
follows:
{code:java}
   *   FLOAT - signed comparison of the represented value
   *   DOUBLE - signed comparison of the represented value
{code}
The problem is that the comparison of floating point numbers is only a partial 
ordering with strange behaviour in specific corner cases. For example, 
according to IEEE 754, -0 is neither less nor more than \+0 and comparing NaN 
to anything always returns false. This ordering is not suitable for statistics. 
Additionally, the Java implementation already uses a different (total) ordering 
that handles these cases correctly but differently than the C\+\+ 
implementations, which leads to interoperability problems.

TypeDefinedOrder for doubles and floats should be deprecated and a new 
TotalFloatingPointOrder should be introduced. The default for writing doubles 
and floats would be the new TotalFloatingPointOrder. The proposed ordering is 
the following:
 * -∞
 * negative numbers in their natural order
 * -0 and +0 in the same equivalence class \(!)
 * positive numbers in their natural order
 * +∞
 * all NaN values, including the negative ones \(!), in the same equivalence 
class \(!)

This ordering should be effective and easy to implement in all programming 
languages. A visual representation of the ordering of some example values:

!ordering.png|width=640px!

For reading existing stats created using TypeDefinedOrder, the following 
compatibility rules should be applied:
* When looking for NaN values, min and max should be ignored.
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is \+0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain \+0 values as well.

  was:
Currently parquet-format specifies the sort order for floating point numbers as 
follows:
{code:java}
   *   FLOAT - signed comparison of the represented value
   *   DOUBLE - signed comparison of the represented value
{code}
The problem is that the comparison of floating point numbers is only a partial 
ordering with strange behaviour in specific corner cases. For example, 
according to IEEE 754, -0 is neither less nor more than \+0 and comparing NaN 
to anything always returns false. This ordering is not suitable for statistics. 
Additionally, the Java implementation already uses a different (total) ordering 
that handles these cases correctly but differently than the C\+\+ 
implementations, which leads to interoperability problems.

TypeDefinedOrder for doubles and floats should be deprecated and a new 
TotalFloatingPointOrder should be introduced. The default for writing doubles 
and floats would be the new TotalFloatingPointOrder. The proposed ordering is 
the following:
 * -∞
 * negative numbers in their natural order
 * -0 and +0 in the same equivalence class \(!)
 * positive numbers in their natural order
 * +∞
 * all NaN values, including the negative ones \(!), in the same equivalence 
class \(!)

This ordering should be effective and easy to implement in all programming 
languages. A visual representation of the ordering of some example values:

!ordering.png|width=500px!

For reading existing stats created using TypeDefinedOrder, the following 
compatibility rules should be applied:
* When looking for NaN values, min and max should be ignored.
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is \+0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain \+0 values as well.


> Definition of float and double sort order is ambigious
> ------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>             Fix For: format-2.5.0
>
>         Attachments: ordering.png
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. The proposed ordering is 
> the following:
>  * -∞
>  * negative numbers in their natural order
>  * -0 and +0 in the same equivalence class \(!)
>  * positive numbers in their natural order
>  * +∞
>  * all NaN values, including the negative ones \(!), in the same equivalence 
> class \(!)
> This ordering should be effective and easy to implement in all programming 
> languages. A visual representation of the ordering of some example values:
> !ordering.png|width=640px!
> For reading existing stats created using TypeDefinedOrder, the following 
> compatibility rules should be applied:
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is \+0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain \+0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-1222) Definition of float and double sort order is ambigious

Reply via email to