[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643902#comment-17643902
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-----------------------------------------

pitrou commented on code in PR #185:
URL: https://github.com/apache/parquet-format/pull/185#discussion_r1041067303


##########
README.md:
##########
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
 
 [logical-types]: LogicalTypes.md
 
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:

Review Comment:
   ```suggestion
   Parquet stores min/max statistics at several levels (such as Column Chunk,
   Column Index and Data Page). Comparison for values of a type obey the
   following rules:
   ```



##########
README.md:
##########
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
 
 [logical-types]: LogicalTypes.md
 
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1.  Each logical type has a specified comparison order. If a column is
+    annotated with an unknown logical type, statistics may not be used
+    for pruning data. The sort order for logical types is documented in
+    the [LogicalTypes.md][logical-types] page.
+2.  For primitives the following sort orders apply:
+
+    * BOOLEAN - false, true
+    * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values 
are
+      not totally ordered due to special case like NaN. They require special
+      handling when reading statistics. The details are documented in 
parquet.thrift in the
+      `ColumnOrder` union. They are summarized 

Review Comment:
   Need to synchronize this with the final wording from `parquet.thrift`.



##########
README.md:
##########
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
 
 [logical-types]: LogicalTypes.md
 
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1.  Each logical type has a specified comparison order. If a column is
+    annotated with an unknown logical type, statistics may not be used
+    for pruning data. The sort order for logical types is documented in
+    the [LogicalTypes.md][logical-types] page.
+2.  For primitives the following sort orders apply:

Review Comment:
   ```suggestion
   2.  For primitive types, the following rules apply:
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -902,6 +902,13 @@ union ColumnOrder {
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
    *     - When looking for NaN values, min and max should be ignored.
+   * 
+   *     When writing statistics the following rules should be followed:
+   *     - NaNs should not be written to min or max statistics fields.
+   *     - Only -0 should be written into min statistics fields (if only 
+   *       +0 is present in the column it should be converted to -0.0).
+   *     - Only +0 should be written into a max statistics fields (if 
+   *       only -0 is present it must be convereted to +0).

Review Comment:
   Suggestion to make wording clearer.
   ```suggestion
      *     - If the computed max value is zero (whether negative or positive),
      *       `+0.0` should be written into the max statistics field.
      *     - If the computed min value is zero (whether negative or positive),
      *       `-0.0` should be written into the min statistics field.
   ```



##########
README.md:
##########
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
 
 [logical-types]: LogicalTypes.md
 
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1.  Each logical type has a specified comparison order. If a column is
+    annotated with an unknown logical type, statistics may not be used
+    for pruning data. The sort order for logical types is documented in
+    the [LogicalTypes.md][logical-types] page.
+2.  For primitives the following sort orders apply:
+
+    * BOOLEAN - false, true
+    * INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values 
are

Review Comment:
   I would suggest making a separate list item for floating-point:
   ```suggestion
       * INT32, INT64 - Signed comparison.
       * FLOAT, DOUBLE - Signed comparison with special handling of NaNs
         and signed zeros. The details are documented in...
   ```





> Specify a well-defined sorting order for float and double types
> ---------------------------------------------------------------
>
>                 Key: PARQUET-1222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-format
>            Reporter: Zoltan Ivanfi
>            Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>    *   FLOAT - signed comparison of the represented value
>    *   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to