jorisvandenbossche commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1630186403


##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
   9: optional binary encrypted_column_metadata
 }
 
+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path

Review Comment:
   For context, AFAIK the `_metadata` summary file was a practice originally 
used in Spark (and supported by Parquet-mr), and inspired on that for example 
also taken over in Dask. We then implemented support for this in Arrow C++ / 
PyArrow mostly based on the dask usage (as a downstream user of pyarrow). In 
the meantime though, Spark disabled writing those files by default a long time 
ago (https://issues.apache.org/jira/browse/SPARK-15719), and also dask stopped 
doing this 2 years ago (https://github.com/dask/dask/issues/8901),
   
   Another Parquet dev mailing list thread with some discussion about this: 
https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to