Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

via GitHub Thu, 13 Jun 2024 16:38:34 -0700


adamreeve commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1639049036



##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
   9: optional binary encrypted_column_metadata
 }
 
+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path

Review Comment:
   >> And the benefits of using this _metadata index file should translate to 
cloud object stores too by reducing the number of objects/files to be read.
   
   > not really
   
   Sorry I couldn't really follow this argument, that sounds like a Hadoop 
specific problem. To me a cloud object store means something like S3, and for 
our use case we're mostly concerned with reducing the number of objects that 
need to be read to satisfy read queries that filter data and don't need to read 
all files in a dataset, as we have many concurrent jobs running and adding load 
on storage.
   
   > Another Parquet dev mailing list thread with some discussion about this: 
https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7
   
   Much of the discussion there seems to be related to issues users can run 
into if doing things like overwriting Parquet files or having heterogeneous 
schemas, which this feature was not designed for. But it sounds like others 
have also found this feature useful. I think this quote from Patrick Woody 
matches our experience: "outstandingly useful when you have well laid out data 
with a sort-order"
   
   > @adamreeve I see so the parquet file is one with all the metadata and all 
the data is in files pointed to by this singleton.
   
   Yes, exactly.
   
   The `_metadata` file format could have been designed so that the `file_path` 
field wasn't needed in the column chunk metadata. But it's there now and 
provides value to users while adding minimal overhead to those not using it 
(missing fields require zero space in serialized messages if I've understood 
the [Thrift Compact 
Protocol](https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md)
 correctly).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DRAFT: Parquet 3 metadata with decoupled column metadata [parquet-format]

Reply via email to