[
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644148#comment-17644148
]
ASF GitHub Bot commented on PARQUET-1222:
-----------------------------------------
emkornfield commented on code in PR #185:
URL: https://github.com/apache/parquet-format/pull/185#discussion_r1041775669
##########
README.md:
##########
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
[logical-types]: LogicalTypes.md
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1. Each logical type has a specified comparison order. If a column is
+ annotated with an unknown logical type, statistics may not be used
+ for pruning data. The sort order for logical types is documented in
+ the [LogicalTypes.md][logical-types] page.
+2. For primitives the following sort orders apply:
+
+ * BOOLEAN - false, true
+ * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values
are
+ not totally ordered due to special case like NaN. They require special
+ handling when reading statistics. The details are documented in
parquet.thrift in the
+ `ColumnOrder` union. They are summarized
Review Comment:
done.
> Specify a well-defined sorting order for float and double types
> ---------------------------------------------------------------
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
> Issue Type: Bug
> Components: parquet-format
> Reporter: Zoltan Ivanfi
> Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers
> as follows:
> {code:java}
> * FLOAT - signed comparison of the represented value
> * DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a
> partial ordering with strange behaviour in specific corner cases. For
> example, according to IEEE 754, -0 is neither less nor more than \+0 and
> comparing NaN to anything always returns false. This ordering is not suitable
> for statistics. Additionally, the Java implementation already uses a
> different (total) ordering that handles these cases correctly but differently
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new
> TotalFloatingPointOrder should be introduced. The default for writing doubles
> and floats would be the new TotalFloatingPointOrder. This ordering should be
> effective and easy to implement in all programming languages.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)