AudriusButkevicius commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1623924516
##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
9: optional binary encrypted_column_metadata
}
+struct ColumnChunkV3 {
+ /** File where column data is stored. **/
+ 1: optional string file_path
Review Comment:
Correct me if I am wrong, but the main purpose of this is to gather the row
group metadata into a single file, so that when reading a large dataset, you
don't need to open each and every file in order to read its metadata to decide
whether any groups in that file need reading.
This is particularly expensive on networked filesystems, for very large
datasets, so this feature is very useful for minimising io and latency in
reading.
My usecase is reading the same large dataset from a compute farm (thousands
of machines), and walking the directory tree and opening each file to read
metadata from each of the farm nodes really upsets the network filesystem.
I guess I fail to see how this is not useful, or PITA, given this is fully
optional, but I'm happy for this to go as long as there is an alternative for
this, which reduces io in the same way.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]