Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Tue, 21 May 2024 23:09:12 -0700


ggershinsky commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1609311732



##########
README.md:
##########
@@ -107,12 +113,97 @@ start locations.  More details on what is contained in 
the metadata can be found
 in the Thrift definition.
 
 Metadata is written after the data to allow for single pass writing.
+This is especially useful when writing to backends such as S3.
 
 Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read 
sequentially.
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+### Parquet 3
+
+Parquet 3 files have the following overall structure:
+
+```
+4-byte magic number "PAR1"
+4-byte magic number "PAR3"
+
+<Column 1 Chunk 1 + Column Metadata>
+<Column 2 Chunk 1 + Column Metadata>
+...
+<Column N Chunk 1 + Column Metadata>
+<Column 1 Chunk 2 + Column Metadata>
+<Column 2 Chunk 2 + Column Metadata>
+...
+<Column N Chunk 2 + Column Metadata>
+...
+<Column 1 Chunk M + Column Metadata>
+<Column 2 Chunk M + Column Metadata>
+...
+<Column N Chunk M + Column Metadata>
+
+<File-level Column 1 Metadata v3>

Review Comment:
   If we'll have a file-level column metadata, we can optimize the storage of 
key_metadata of encrypted columns - by keeping the column key_metadata here, 
instead of the ColumnChunk structures. 
   Note: this is a storage-only optimization [O(N) instead of O(NxM)]; the 
reader processes only one key_metadata object per column [O(N)] already today.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to