Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Mon, 03 Jun 2024 00:48:15 -0700


AudriusButkevicius commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1623924516



##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
   9: optional binary encrypted_column_metadata
 }
 
+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path

Review Comment:
   Correct me if I am wrong, but the main purpose of this is to gather the row 
group metadata into a single file, so that when reading a large dataset, you 
don't need to open each and every file in order to read its metadata to decide 
whether any groups in that file need reading.
   
   This is particularly expensive on networked filesystems, for very large 
datasets, so this feature is very useful for minimising io and latency in 
reading.
   
   My usecase is reading the same large dataset from a compute farm (thousands 
of machines), and walking the directory tree and opening each file to read 
metadata from each of the farm nodes really upsets the network filesystem.
   
   I guess I fail to see how this is not useful, or PITA, given this is fully 
optional, but I'm happy for this to go as long as there is an alternative for 
this, which reduces io in the same way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to