jorisvandenbossche commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1630186403
##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
9: optional binary encrypted_column_metadata
}
+struct ColumnChunkV3 {
+ /** File where column data is stored. **/
+ 1: optional string file_path
Review Comment:
For context, AFAIK the `_metadata` summary file was a practice originally
used in Spark (and supported by Parquet-mr), and inspired on that for example
also taken over in Dask. We then implemented support for this in Arrow C++ /
PyArrow mostly based on the dask usage (as a downstream user of pyarrow). In
the meantime though, Spark disabled writing those files by default a long time
ago, and also dask stopped doing this 2 years ago
(https://github.com/dask/dask/issues/8901),
Another Parquet dev mailing list thread with some discussion about this:
https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]