This is an automated email from the ASF dual-hosted git repository.

gangwu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 3ce0760  MINOR: Small documentation fixes and deduplication (#491)
3ce0760 is described below

commit 3ce0760933b875bc8a11f5be0b883cd107b95b43
Author: Jan Finis <[email protected]>
AuthorDate: Fri Apr 18 04:16:54 2025 +0200

    MINOR: Small documentation fixes and deduplication (#491)
---
 README.md                      | 37 ++++---------------------------------
 src/main/thrift/parquet.thrift |  5 +++--
 2 files changed, 7 insertions(+), 35 deletions(-)

diff --git a/README.md b/README.md
index df0ac73..ae7272f 100644
--- a/README.md
+++ b/README.md
@@ -155,40 +155,11 @@ documented in [LogicalTypes.md][logical-types].
 [logical-types]: LogicalTypes.md
 
 ### Sort Order
-
 Parquet stores min/max statistics at several levels (such as Column Chunk,
-Column Index and Data Page). Comparison for values of a type obey the
-following rules:
-
-1.  Each logical type has a specified comparison order. If a column is
-    annotated with an unknown logical type, statistics may not be used
-    for pruning data. The sort order for logical types is documented in
-    the [LogicalTypes.md][logical-types] page.
-2.  For primitive types, the following rules apply:
-
-    * BOOLEAN - false, true
-    * INT32, INT64 - Signed comparison.
-    * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
-      signed zeros.   The details are documented in the
-      [Thrift definition](src/main/thrift/parquet.thrift) in the
-      `ColumnOrder` union. They are summarized here but the Thrift definition
-      is considered authoritative:
-      * NaNs should not be written to min or max statistics fields.
-      * If the computed max value is zero (whether negative or positive),
-        `+0.0` should be written into the max statistics field.
-      * If the computed min value is zero (whether negative or positive),
-        `-0.0` should be written into the min statistics field.
-
-      For backwards compatibility when reading files:
-      * If the min is a NaN, it should be ignored.
-      * If the max is a NaN, it should be ignored.
-      * If the min is +0, the row group may contain -0 values as well.
-      * If the max is -0, the row group may contain +0 values as well.
-      * When looking for NaN values, min and max should be ignored.
-
-    * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
-      comparison.
-
+Column Index, and Data Page). These statistics are according to a sort order,
+which is defined for each column in the file footer. Parquet supports common
+sort orders for logical and primitve types. The details are documented in the
+[Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.
 
 ## Nested Encoding
 To encode nested columns, Parquet uses the Dremel encoding with definition and
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index ff32717..59ec5f1 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -313,12 +313,12 @@ struct Statistics {
 
 /** Empty structs to use as logical type annotations */
 struct StringType {}  // allowed for BYTE_ARRAY, must be encoded with UTF-8
-struct UUIDType {}    // allowed for FIXED[16], must encoded raw UUID bytes
+struct UUIDType {}    // allowed for FIXED[16], must be encoded as raw UUID 
bytes
 struct MapType {}     // see LogicalTypes.md
 struct ListType {}    // see LogicalTypes.md
 struct EnumType {}    // allowed for BYTE_ARRAY, must be encoded with UTF-8
 struct DateType {}    // allowed for INT32
-struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
+struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 
bytes (see LogicalTypes.md)
 
 /**
  * Logical type to annotate a column that is always null.
@@ -1057,6 +1057,7 @@ union ColumnOrder {
    *   UINT64 - unsigned comparison
    *   DECIMAL - signed comparison of the represented value
    *   DATE - signed comparison
+   *   FLOAT16 - signed comparison of the represented value (*)
    *   TIME_MILLIS - signed comparison
    *   TIME_MICROS - signed comparison
    *   TIMESTAMP_MILLIS - signed comparison

Reply via email to