adamreeve commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1629398130
##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
9: optional binary encrypted_column_metadata
}
+struct ColumnChunkV3 {
+ /** File where column data is stored. **/
+ 1: optional string file_path
Review Comment:
I think Audrius has answered your question @JFinis (Audrius and I are work
colleagues), I just want to add that I don't think what we're doing is
particularly unusual as this is a feature of the Arrow Dataset library, so
presumably there are other users of this API (see
[pyarrow.dataset.parquet_dataset](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.parquet_dataset.html)).
And the benefits of using this `_metadata` index file should translate to
cloud object stores too by reducing the number of objects/files to be read.
Maybe one factor that isn't clear is that this `_metadata` file contains
metadata for multiple (usually many) Parquet files, which is why it helps to
have this separate file. If there was only a single Parquet file in the dataset
then the `_metadata` file would provide no benefit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]