adamreeve commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1629001537


##########
src/main/thrift/parquet.thrift:
##########
@@ -885,6 +971,44 @@ struct ColumnChunk {
   9: optional binary encrypted_column_metadata
 }
 
+struct ColumnChunkV3 {
+  /** File where column data is stored. **/
+  1: optional string file_path

Review Comment:
   We at G-Research would very much like to see support for `_metadata` files 
(and therefore the `file_path` field) remain in Parquet V3. Using a `_metadata` 
file provides significant performance benefit for us by reducing the number of 
file system operations required for certain queries when using the Arrow 
Dataset library. This is especially important when using a network file system. 
In a benchmark comparing query performance with and without using a `_metadata` 
file I found that queries were nearly 5x faster when using a Dataset built from 
the `_metadata` file. This was a [synthetic 
benchmark](https://github.com/adamreeve/dataset-metadata-benchmark/tree/caac9d5115cfcda309939cf7deafb78f36b41d37)
 using a local instance of minio rather than using our production data and 
infrastructure,  but demonstrates the performance benefit this `_metadata` file 
can provide.
   
   We understand that the main reason for deprecating this field is that people 
can use alternative open table formats like Apache Iceberg or Delta Lake which 
provide similar performance benefits, but these add a lot of complexity and 
include many extra features that we don't always need. We also use .NET and are 
working on creating bindings to the Arrow Dataset library for .NET, and don't 
want to also have to develop a .NET Iceberg implementation.
   
   There also doesn't seem to be much harm in keeping this field from what I 
can see, as it can safely be ignored by Parquet readers that don't use 
`_metadata` files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to