alamb commented on code in PR #535:
URL: https://github.com/apache/parquet-format/pull/535#discussion_r2594291019
##########
src/main/thrift/parquet.thrift:
##########
@@ -1255,7 +1259,12 @@ union EncryptionAlgorithm {
* Description for file metadata
*/
struct FileMetaData {
- /** Version of this file **/
+ /** Version of this file
+ *
+ * Deprecated. Readers should determine if they support reading based on
+ * specific metadata (e.g. encoding enum) rather then relying on this field
+ * to make this determination.
+ */
Review Comment:
Rather than potentially (implicitly) changing the spec, I suggest we add
some langague to help people understand how to interpret this field and the
tradeoffs between using different versions.
For example, maybe we could say something like
> As of December 2025, there is no agreed upon set of features that
constitute version 2, so for maximum compatibility, writers should populate "1"
for version and accept "1" and "2" interchangeably.
##########
src/main/thrift/parquet.thrift:
##########
@@ -715,6 +715,13 @@ struct DictionaryPageHeader {
* New page format allowing reading levels without decompressing the data
* Repetition and definition levels are uncompressed
* The remaining section containing the data is compressed if is_compressed is
true
+ *
+ * Implementation note - this header is not necessarily a strict improvement
over
+ * `DataPageHeader` (in particular the original header might provide better
compression
+ * in some scenarios). Page indexes require pages start and end at row
boundaries regardless of which
+ * page header is used.
+ *
+ * As of December 2025, most known Parquet readers can read pages using this
header.
Review Comment:
Maybe we can also change the content of
> * New page format allowing reading levels without decompressing the data
To another word to make it clearer that there is no agreed upon expectation
that all writers will eventually use this new header. Perhaps:
> * Alternate page format allowing reading levels without decompressing the
data
##########
src/main/thrift/parquet.thrift:
##########
@@ -715,6 +715,13 @@ struct DictionaryPageHeader {
* New page format allowing reading levels without decompressing the data
* Repetition and definition levels are uncompressed
* The remaining section containing the data is compressed if is_compressed is
true
+ *
+ * Implementation note - this header is not necessarily a strict improvement
over
+ * `DataPageHeader` (in particular the original header might provide better
compression
+ * in some scenarios). Page indexes require pages start and end at row
boundaries regardless of which
+ * page header is used.
+ *
+ * As of December 2025, most known Parquet readers can read pages using this
header.
Review Comment:
Perhaps we could refer readers to
https://parquet.apache.org/docs/file-format/implementationstatus/ for the most
up to date information
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]