This is an automated email from the ASF dual-hosted git repository.
emkornfield pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 4b1c72c GH-534: Clarify versioning and V2 (#535)
4b1c72c is described below
commit 4b1c72c837bec5b792b2514f0057533030fcedf8
Author: emkornfield <[email protected]>
AuthorDate: Fri Dec 19 08:37:30 2025 -0800
GH-534: Clarify versioning and V2 (#535)
Clarify versioning and no restrictions on encodings.
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
---
Encodings.md | 7 +++++--
src/main/thrift/parquet.thrift | 16 ++++++++++++++--
2 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/Encodings.md b/Encodings.md
index 62b4eb9..e620e9a 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -22,6 +22,9 @@ Parquet encoding definitions
This file contains the specification of all supported encodings.
+Unless otherwise stated in page or encoding documentation, any encoding can be
+used with any page type.
+
<a name="PLAIN"></a>
### Plain: (PLAIN = 0)
@@ -59,8 +62,8 @@ Dictionary page format: the entries in the dictionary using
the [plain](#PLAIN)
Data page format: the bit width used to encode the entry ids stored as 1 byte
(max bit width = 32),
followed by the values encoded using RLE/Bit packed described above (with the
given bit width).
-Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0
specification. Prefer using RLE_DICTIONARY
-in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.
+Using the `PLAIN_DICTIONARY` enum value is deprecated, use `RLE_DICTIONARY`
+in a data page and `PLAIN` in a dictionary page for new Parquet files.
<a name="RLE"></a>
### Run Length Encoding / Bit-Packing Hybrid (RLE = 3)
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index e99c461..7ff9b9f 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -712,9 +712,14 @@ struct DictionaryPageHeader {
}
/**
- * New page format allowing reading levels without decompressing the data
+ * Alternate page format allowing reading levels without decompressing the data
* Repetition and definition levels are uncompressed
* The remaining section containing the data is compressed if is_compressed is
true
+ *
+ * Implementation note - this header is not necessarily a strict improvement
over
+ * `DataPageHeader` (in particular the original header might provide better
compression
+ * in some scenarios). Page indexes require pages to start and end at row
boundaries,
+ * regardless of which page header is used.
**/
struct DataPageHeaderV2 {
/** Number of values, including NULLs, in this data page. **/
@@ -1255,7 +1260,14 @@ union EncryptionAlgorithm {
* Description for file metadata
*/
struct FileMetaData {
- /** Version of this file **/
+ /** Version of this file
+ *
+ * As of December 2025, there is no agreed upon consensus of what
constitutes
+ * version 2 of the file. For maximum compatibility with readers, writers
should
+ * always populate "1" for version. For maximum compatibility with writers,
+ * readers should accept "1" and "2" interchangeably. All other versions
are
+ * reserved for potential future use-cases.
+ */
1: required i32 version
/** Parquet schema for this file. This schema contains metadata for all the
columns.