fjetter commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1604657513
##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
9: optional binary encrypted_column_metadata
}
+struct ColumnChunkV3 {
+ /** File where column data is stored. **/
+ 1: optional string file_path
Review Comment:
Dask is indeed supporting this but it's causing quite some headaches. It's
been the default for a while but it turned out to be impractical for any real
world datasets and we are currently mostly supporting it for backwards
compatibility. Especially when including statistics, this became just
impractical to use due to it's size so this new metadata proposal will likely
not change that much about it's usefulness.
I can see how this could be used to build something useful but the way dask
is using it right now is not reason enough to keep this. (cc @rjzamora in case
you have a different perspective).
Also, I'm not even sure if pyarrow exposes an API to store the metadata
exclusively in an external file. From what I understand this is always a
duplicate so this field may actually not be required to support what we are
doing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]