alamb opened a new issue, #5770:
URL: https://github.com/apache/arrow-rs/issues/5770

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   There are several proposals for remedying perceived issues with parquet 
which generally propose new formats. For example [Lance 
V2](https://blog.lancedb.com/lance-v2/) and 
[Nimble](https://github.com/facebookincubator/nimble)
   
   One of the technical challenges raised about Parquet is that the 
[metadata](https://docs.rs/parquet/latest/parquet/file/metadata/index.html)  is 
encoded such that the entire footer must be read and decoded prior to reading 
any data. 
   
   As the numer of columns increases, the argument goes, the size of the 
parquet metadata increases beyond the ~8MB sweet spot for a single object store 
request as well as requiring substantial CPU to decode
   
   However, my theory is that the reason that parquet metadata is typically so 
large for schemas with many columns is the embedded min/max 
[statistical](https://docs.rs/parquet/latest/parquet/file/metadata/struct.ColumnChunkMetaData.html#method.statistics)
 values for columns / pages
   
   
   **Describe the solution you'd like**
   I would like to gather data on parquet footer metadata size as a function of:
   1. The number of columns
   2. The number of row groups
   3. if Statistics are enabled / disabled
   
   And then report this in a blog with some sort of conclusion about how well 
parquet can handle large schemas
   
   Bonus points if we can also measure in memory size (though this will of 
course vary from implementation to implementation)
   
   **Describe alternatives you've considered**
   
   
   **Additional context**
   Related discussion with @wesm on twitter: 
https://twitter.com/wesmckinn/status/1790884370603024826
   
   Cited issue https://github.com/apache/arrow/issues/39676
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to