[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611398#comment-17611398
 ] 

Gabor Szadovszky commented on PARQUET-1222:
-------------------------------------------

[~emkornfield], I think we do not need to handle NaN values with a boolean to 
fix this issue. NaN is kind of similar than null values so we may even count 
them instead of having a boolean but this question is not tightly related to 
this topic.
What do you think about elevating the current suggestion in the thrift file to 
specification level for writing/reading FP min/max values?
{quote}Because the sorting order is not specified properly for floating point 
values (relations vs. total ordering) the following compatibility rules should 
be applied when reading statistics:
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is +0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain +0 values as well.
* When looking for NaN values, min and max should be ignored.{quote}
For writing we shall skip NaN values and use -0 for min and +0 for max any time 
when a 0 is to be taken into account.

With this solution we cannot do anything clever in case of searching for a NaN 
but it can be fixed separately. And we also need to double-check whether we 
really ignore the min/max stats in case of searching for a NaN.

I think it is a good idea to discuss such topics on the mailing list. However, 
we should also time-box the discussion and go forward with a proposed solution 
if there are no interests on the mailing list. (Personally, I do not follow the 
dev list anymore.)


> Specify a well-defined sorting order for float and double types
> ---------------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to