Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Thu, 06 Jun 2024 05:10:55 -0700


adamreeve commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1629398130



##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
   9: optional binary encrypted_column_metadata
 }
 
+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path

Review Comment:
   I think Audrius has answered your question @JFinis (Audrius and I are work 
colleagues), I just want to add that I don't think what we're doing is 
particularly unusual as this is a feature of the Arrow Dataset library, so 
presumably there are other users of this API (see 
[pyarrow.dataset.parquet_dataset](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.parquet_dataset.html)).
   
   And the benefits of using this `_metadata` index file should translate to 
cloud object stores too by reducing the number of objects/files to be read. 
Maybe one factor that isn't clear is that this `_metadata` file contains 
metadata for multiple (usually many) Parquet files, which is why it helps to 
have this separate file. If there was only a single Parquet file in the dataset 
then the `_metadata` file would provide no benefit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to