This is an automated email from the ASF dual-hosted git repository.
gangwu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 3ce0760 MINOR: Small documentation fixes and deduplication (#491)
3ce0760 is described below
commit 3ce0760933b875bc8a11f5be0b883cd107b95b43
Author: Jan Finis <[email protected]>
AuthorDate: Fri Apr 18 04:16:54 2025 +0200
MINOR: Small documentation fixes and deduplication (#491)
---
README.md | 37 ++++---------------------------------
src/main/thrift/parquet.thrift | 5 +++--
2 files changed, 7 insertions(+), 35 deletions(-)
diff --git a/README.md b/README.md
index df0ac73..ae7272f 100644
--- a/README.md
+++ b/README.md
@@ -155,40 +155,11 @@ documented in [LogicalTypes.md][logical-types].
[logical-types]: LogicalTypes.md
### Sort Order
-
Parquet stores min/max statistics at several levels (such as Column Chunk,
-Column Index and Data Page). Comparison for values of a type obey the
-following rules:
-
-1. Each logical type has a specified comparison order. If a column is
- annotated with an unknown logical type, statistics may not be used
- for pruning data. The sort order for logical types is documented in
- the [LogicalTypes.md][logical-types] page.
-2. For primitive types, the following rules apply:
-
- * BOOLEAN - false, true
- * INT32, INT64 - Signed comparison.
- * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and
- signed zeros. The details are documented in the
- [Thrift definition](src/main/thrift/parquet.thrift) in the
- `ColumnOrder` union. They are summarized here but the Thrift definition
- is considered authoritative:
- * NaNs should not be written to min or max statistics fields.
- * If the computed max value is zero (whether negative or positive),
- `+0.0` should be written into the max statistics field.
- * If the computed min value is zero (whether negative or positive),
- `-0.0` should be written into the min statistics field.
-
- For backwards compatibility when reading files:
- * If the min is a NaN, it should be ignored.
- * If the max is a NaN, it should be ignored.
- * If the min is +0, the row group may contain -0 values as well.
- * If the max is -0, the row group may contain +0 values as well.
- * When looking for NaN values, min and max should be ignored.
-
- * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise
- comparison.
-
+Column Index, and Data Page). These statistics are according to a sort order,
+which is defined for each column in the file footer. Parquet supports common
+sort orders for logical and primitve types. The details are documented in the
+[Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.
## Nested Encoding
To encode nested columns, Parquet uses the Dremel encoding with definition and
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index ff32717..59ec5f1 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -313,12 +313,12 @@ struct Statistics {
/** Empty structs to use as logical type annotations */
struct StringType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8
-struct UUIDType {} // allowed for FIXED[16], must encoded raw UUID bytes
+struct UUIDType {} // allowed for FIXED[16], must be encoded as raw UUID
bytes
struct MapType {} // see LogicalTypes.md
struct ListType {} // see LogicalTypes.md
struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8
struct DateType {} // allowed for INT32
-struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
+struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16
bytes (see LogicalTypes.md)
/**
* Logical type to annotate a column that is always null.
@@ -1057,6 +1057,7 @@ union ColumnOrder {
* UINT64 - unsigned comparison
* DECIMAL - signed comparison of the represented value
* DATE - signed comparison
+ * FLOAT16 - signed comparison of the represented value (*)
* TIME_MILLIS - signed comparison
* TIME_MICROS - signed comparison
* TIMESTAMP_MILLIS - signed comparison