adamreeve commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1639049036
##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
9: optional binary encrypted_column_metadata
}
+struct ColumnChunkV3 {
+ /** File where column data is stored. **/
+ 1: optional string file_path
Review Comment:
>> And the benefits of using this _metadata index file should translate to
cloud object stores too by reducing the number of objects/files to be read.
> not really
Sorry I couldn't really follow this argument, that sounds like a Hadoop
specific problem. To me a cloud object store means something like S3, and for
our use case we're mostly concerned with reducing the number of objects that
need to be read to satisfy read queries that filter data and don't need to read
all files in a dataset, as we have many concurrent jobs running and adding load
on storage.
> Another Parquet dev mailing list thread with some discussion about this:
https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7
Much of the discussion there seems to be related to issues users can run
into if doing things like overwriting Parquet files or having heterogeneous
schemas, which this feature was not designed for. But it sounds like others
have also found this feature useful. I think this quote from Patrick Woody
matches our experience: "outstandingly useful when you have well laid out data
with a sort-order"
> @adamreeve I see so the parquet file is one with all the metadata and all
the data is in files pointed to by this singleton.
Yes, exactly.
The `_metadata` file format could have been designed so that the `file_path`
field wasn't needed in the column chunk metadata. But it's there now and
provides value to users while adding minimal overhead to those not using it
(missing fields require zero space in serialized messages if I've understood
the [Thrift Compact
Protocol](https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md)
correctly).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]