[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316180#comment-17316180
 ] 

Gabor Szadovszky commented on PARQUET-1222:
-------------------------------------------

[~apitrou], I guess what you've described is the write path of the statistics. 
Because you cannot control other writers I would suggest following the [spec 
for the read 
path|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L892-L899].
Meanwhile, I've done some investigation in the parquet-mr code and the format 
and there are issues related to this topic.
* We have created the ColumnOrder object and the related field in the format to 
specify the ordering of the columns and to prepare for the potential solution 
of this (and similar) issues. We are referencing this field in the Statistics 
object used for row-group level stats. Meanwhile, we do not reference this in 
the column indexes. So, in column indexes it is not clear what sorting orders 
do we want to use and how to handle cases like this. How it is implemented in 
parquet-cpp?
* Based on the referenced workaround we handle the special floating point 
values at row-group level in parquet-mr but only for the read path. For the 
write path we still write these values.
* For column indexes we handle these values but only for the write path and not 
for the read path. 

So, we have a couple of issues around this topic and it would be great if we 
would have a final and well defined solution for it.

> Specify a well-defined sorting order for float and double types
> ---------------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to