Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Thu, 06 Jun 2024 06:58:40 -0700


adamreeve commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1629579263



##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
   9: optional binary encrypted_column_metadata
 }
 
+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path

Review Comment:
   @alkis, no, there is no `_metadata` column in the schema. There is a file 
named `_metadata` at the root of the dataset directory which contains a copy of 
the metadata from all N Parquet files, with the only difference being that the 
metadata in this file has the `file_path` field set to the path of the file 
containing the data corresponding to each metadata copy. This `_metadata` file 
contains no data pages itself, but can be used like an index to determine which 
file to read data from based on the metadata.
   
   Also, because the Parquet files containing the data still store their own 
metadata (with `file_path` fields not set), they can be read independent of the 
`_metadata` file. The dataset is still readable by implementations that don't 
support `_metadata` files, but this file can be used as an optimisation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to