[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

Gabor Szadovszky (Jira) Sun, 09 Oct 2022 23:31:07 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614907#comment-17614907
 ]


Gabor Szadovszky commented on PARQUET-1222:
-------------------------------------------

[~emkornfield],

There are a couple of docs in the parquet-format repo. The related ones are 
[about logical 
types|[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]] 
and the main one that contains the description of the [primitive 
types|https://github.com/apache/parquet-format/blob/master/README.md#types]. 
Unfortunately, the latter one does not contain anything about sorting order.
So, I think, we need to do the following:
* Define the sorting order for the primitive types or reference the logical 
types description for it. (In most cases it would be referencing since the 
ordering depends on the related logical types e.g. signed/unsigned sorting of 
integral types)
* After defining the sorting order of the primitive floating point numbers 
based on what we've discussed above reference it from the new half-precision FP 
logical type.

(Another unfortunate thing is that we have some specification-like docs at the 
[parquet site|https://parquet.apache.org] as well. I think we should propagate 
the parquet-format docs to there automatically or simply link them from the 
site. But it is clearly a different topic.)

> Specify a well-defined sorting order for float and double types
> ---------------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

Reply via email to