[
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611398#comment-17611398
]
Gabor Szadovszky commented on PARQUET-1222:
-------------------------------------------
[~emkornfield], I think we do not need to handle NaN values with a boolean to
fix this issue. NaN is kind of similar than null values so we may even count
them instead of having a boolean but this question is not tightly related to
this topic.
What do you think about elevating the current suggestion in the thrift file to
specification level for writing/reading FP min/max values?
{quote}Because the sorting order is not specified properly for floating point
values (relations vs. total ordering) the following compatibility rules should
be applied when reading statistics:
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is +0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain +0 values as well.
* When looking for NaN values, min and max should be ignored.{quote}
For writing we shall skip NaN values and use -0 for min and +0 for max any time
when a 0 is to be taken into account.
With this solution we cannot do anything clever in case of searching for a NaN
but it can be fixed separately. And we also need to double-check whether we
really ignore the min/max stats in case of searching for a NaN.
I think it is a good idea to discuss such topics on the mailing list. However,
we should also time-box the discussion and go forward with a proposed solution
if there are no interests on the mailing list. (Personally, I do not follow the
dev list anymore.)
> Specify a well-defined sorting order for float and double types
> ---------------------------------------------------------------
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
> Issue Type: Bug
> Components: parquet-format
> Reporter: Zoltan Ivanfi
> Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers
> as follows:
> {code:java}
> * FLOAT - signed comparison of the represented value
> * DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a
> partial ordering with strange behaviour in specific corner cases. For
> example, according to IEEE 754, -0 is neither less nor more than \+0 and
> comparing NaN to anything always returns false. This ordering is not suitable
> for statistics. Additionally, the Java implementation already uses a
> different (total) ordering that handles these cases correctly but differently
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new
> TotalFloatingPointOrder should be introduced. The default for writing doubles
> and floats would be the new TotalFloatingPointOrder. This ordering should be
> effective and easy to implement in all programming languages.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)