[ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Szadovszky updated PARQUET-1222: -------------------------------------- Fix Version/s: format-2.5.0 > Definition of float and double sort order is ambigious > ------------------------------------------------------ > > Key: PARQUET-1222 > URL: https://issues.apache.org/jira/browse/PARQUET-1222 > Project: Parquet > Issue Type: Bug > Components: parquet-format > Reporter: Zoltan Ivanfi > Priority: Critical > Fix For: format-2.5.0 > > > Currently parquet-format specifies the sort order for floating point numbers > as follows: > {code:java} > * FLOAT - signed comparison of the represented value > * DOUBLE - signed comparison of the represented value > {code} > The problem is that the comparison of floating point numbers is only a > partial ordering with strange behaviour in specific corner cases. For > example, according to IEEE 754, -0 is neither less nor more than \+0 and > comparing NaN to anything always returns false. This ordering is not suitable > for statistics. Additionally, the Java implementation already uses a > different (total) ordering that handles these cases correctly but differently > than the C\+\+ implementations, which leads to interoperability problems. > We should explicitly require implementations to follow a specific comparison > logic for these types. The candidates are: > * The [Java > implementation|http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/Double.java#l999] > which looks easy and efficient to implement in any language. > * The [IEEE 754 totalOrder > predicate|https://github.com/rust-lang/rust/issues/5585] which is rather > complicated to the extent that it is hard to tell whether the Java > implementation adheres to it, so in effect this option may actually be the > same as the one above. > * The [IEEE 754-2008 min and max > operations|https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max] which > may be hard to use for comparison, so components could not use the same > sorting order to achieve the smallest possible min/max ranges (although a > regular sort would probably result in an almost optimal value order). > * We could simply require NaNs to be ignored for calculating min/max. > However, we should also explicitly address -0/+0 values in this case, which > probably leads to the option above. > An additional problem is how to deal with existing data: > * One possibility is to specify legacy rules, like "if the min or max is > NaN, it should be ignored" or that "-0 and +0 should be considered equal for > min/max purposes". > * Another alternative is to deprecate `min_value` and `max_value` and > introduce `yet_another_min` and `yet_another_max` fields instead (with nicer > names, naturally). This could be combined with some legacy rules for the old > field. > * Probably the best solution would be to deprecate TypeDefinedOrder for > doubles and floats and introduce a new TotalOrder. The legacy rule "if the > min or max is NaN, it should be ignored" should apply to TypeDefinedOrder > while the new TotalOrder would not have such restrictions. The default for > writing doubles and floats would be the new TotalOrder. -- This message was sent by Atlassian JIRA (v7.6.3#76005)