Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Thu, 16 May 2024 05:32:15 -0700


gszadovszky commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1603224854



##########
README.md:
##########
@@ -113,6 +119,55 @@ chunks they are interested in.  The columns chunks should 
then be read sequentia
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+### Parquet 3
+
+Parquet 3 files have the following overall structure:
+
+```
+4-byte magic number "PAR1"
+4-byte magic number "PAR3"
+8-byte offset of File Metadata v3
+8-byte length of File Metadata v3

Review Comment:
   Doesn't it mean we need to keep all row groups in memory until we start 
closing the file to fill these values? That's why we originally had the footer 
length at the end of the file.



##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
   9: optional binary encrypted_column_metadata
 }
 
+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path

Review Comment:
   Do we want to keep this concept of having the metadata in a separate file? I 
did not see it working anywhere.



##########
README.md:
##########
@@ -113,6 +119,55 @@ chunks they are interested in.  The columns chunks should 
then be read sequentia
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+### Parquet 3
+
+Parquet 3 files have the following overall structure:
+
+```
+4-byte magic number "PAR1"
+4-byte magic number "PAR3"
+8-byte offset of File Metadata v3
+8-byte length of File Metadata v3

Review Comment:
   Wondering if there might be any systems/tools that takes advantage of the 
first page starts just after the magic number. Probably not a serious concern, 
though.



##########
README.md:
##########
@@ -113,6 +119,55 @@ chunks they are interested in.  The columns chunks should 
then be read sequentia
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+### Parquet 3
+
+Parquet 3 files have the following overall structure:
+
+```
+4-byte magic number "PAR1"
+4-byte magic number "PAR3"
+8-byte offset of File Metadata v3
+8-byte length of File Metadata v3

Review Comment:
   An alternative way would be storing these between `File Matadata v3` and 
`File Metadata`



##########
src/main/thrift/parquet.thrift:
##########
@@ -835,6 +864,63 @@ struct ColumnMetaData {
   16: optional SizeStatistics size_statistics;
 }
 
+struct ColumnChunkMetaDataV3 {
+  /** REMOVED from v1: type (redundant with SchemaElementV3) */
+  /** REMOVED from v1: encodings (unnecessary and non-trivial to get right) */
+  /** REMOVED from v1: path_in_schema (unnecessary and wasteful) */
+  /** REMOVED from v1: index_page_offset (unused in practice?) */
+  /** REMOVED from v1: statistics (use ColumnIndex and/or page-level 
statistics instead) */

Review Comment:
   So we cannot filter on row groups easily/quickly anymore. Would be nice to 
see some benchmarks what we loose here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to