danielcweeks commented on code in PR #535:
URL: https://github.com/apache/parquet-format/pull/535#discussion_r2582449315
##########
src/main/thrift/parquet.thrift:
##########
@@ -1255,7 +1259,12 @@ union EncryptionAlgorithm {
* Description for file metadata
*/
struct FileMetaData {
- /** Version of this file **/
+ /** Version of this file
+ *
+ * Deprecated. Readers should determine if they support reading based on
+ * specific metadata (e.g. encoding enum) rather then relying on this field
+ * to make this determination.
+ */
Review Comment:
I disagree with this. I don't think we should abandon versioning, but
rather be more explicit about breaking changes and what is included with
version update. Regardless, this needs more discussion with the community and
a clear path forward for how we support breaking changes.
##########
src/main/thrift/parquet.thrift:
##########
@@ -715,6 +715,10 @@ struct DictionaryPageHeader {
* New page format allowing reading levels without decompressing the data
* Repetition and definition levels are uncompressed
* The remaining section containing the data is compressed if is_compressed is
true
+ *
+ * N.B. this page header is not necessarily strictly better then
DataPageHeader.
+ * Page indexes already require that rows are aligned on page boundaries, and
compressing
+ * repetition and definition levels can still be effective in some cases.
Review Comment:
Are you saying this is deprecated? Why do we need this comment? It's not
clear what you're trying to achieve here. (Nit: prefer not to use
abbreviations like N.B.)
##########
Encodings.md:
##########
@@ -22,6 +22,11 @@ Parquet encoding definitions
This file contains the specification of all supported encodings.
+Some Parquet implementations distinguish encodings as "v1" and "v2". From
+a specification perspective this distinction is considered meaningless.
Writers may use any
+encoding with both data page v1 and data page v2. Readers should lazily
evaluate if they can
+read a file (e.g. only error when required to a read a page with an unknown
encoding).
+
Review Comment:
I feel like we're redefining what `version` means to be scoped only to
encodings and then saying that it's not necessary. It seems like we want to
either separate encodings from versioning (e.g. any encoding that is understood
by a client should be considered supported regardless of when it was
introduced) or be more explicit about associating new encodings with a version
(along with other possible breaking structural/representational changes).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]