[ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316180#comment-17316180 ]
Gabor Szadovszky commented on PARQUET-1222: ------------------------------------------- [~apitrou], I guess what you've described is the write path of the statistics. Because you cannot control other writers I would suggest following the [spec for the read path|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L892-L899]. Meanwhile, I've done some investigation in the parquet-mr code and the format and there are issues related to this topic. * We have created the ColumnOrder object and the related field in the format to specify the ordering of the columns and to prepare for the potential solution of this (and similar) issues. We are referencing this field in the Statistics object used for row-group level stats. Meanwhile, we do not reference this in the column indexes. So, in column indexes it is not clear what sorting orders do we want to use and how to handle cases like this. How it is implemented in parquet-cpp? * Based on the referenced workaround we handle the special floating point values at row-group level in parquet-mr but only for the read path. For the write path we still write these values. * For column indexes we handle these values but only for the write path and not for the read path. So, we have a couple of issues around this topic and it would be great if we would have a final and well defined solution for it. > Specify a well-defined sorting order for float and double types > --------------------------------------------------------------- > > Key: PARQUET-1222 > URL: https://issues.apache.org/jira/browse/PARQUET-1222 > Project: Parquet > Issue Type: Bug > Components: parquet-format > Reporter: Zoltan Ivanfi > Priority: Critical > > Currently parquet-format specifies the sort order for floating point numbers > as follows: > {code:java} > * FLOAT - signed comparison of the represented value > * DOUBLE - signed comparison of the represented value > {code} > The problem is that the comparison of floating point numbers is only a > partial ordering with strange behaviour in specific corner cases. For > example, according to IEEE 754, -0 is neither less nor more than \+0 and > comparing NaN to anything always returns false. This ordering is not suitable > for statistics. Additionally, the Java implementation already uses a > different (total) ordering that handles these cases correctly but differently > than the C\+\+ implementations, which leads to interoperability problems. > TypeDefinedOrder for doubles and floats should be deprecated and a new > TotalFloatingPointOrder should be introduced. The default for writing doubles > and floats would be the new TotalFloatingPointOrder. This ordering should be > effective and easy to implement in all programming languages. -- This message was sent by Atlassian Jira (v8.3.4#803005)