alamb opened a new issue, #5770: URL: https://github.com/apache/arrow-rs/issues/5770
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** There are several proposals for remedying perceived issues with parquet which generally propose new formats. For example [Lance V2](https://blog.lancedb.com/lance-v2/) and [Nimble](https://github.com/facebookincubator/nimble) One of the technical challenges raised about Parquet is that the [metadata](https://docs.rs/parquet/latest/parquet/file/metadata/index.html) is encoded such that the entire footer must be read and decoded prior to reading any data. As the numer of columns increases, the argument goes, the size of the parquet metadata increases beyond the ~8MB sweet spot for a single object store request as well as requiring substantial CPU to decode However, my theory is that the reason that parquet metadata is typically so large for schemas with many columns is the embedded min/max [statistical](https://docs.rs/parquet/latest/parquet/file/metadata/struct.ColumnChunkMetaData.html#method.statistics) values for columns / pages **Describe the solution you'd like** I would like to gather data on parquet footer metadata size as a function of: 1. The number of columns 2. The number of row groups 3. if Statistics are enabled / disabled And then report this in a blog with some sort of conclusion about how well parquet can handle large schemas Bonus points if we can also measure in memory size (though this will of course vary from implementation to implementation) **Describe alternatives you've considered** **Additional context** Related discussion with @wesm on twitter: https://twitter.com/wesmckinn/status/1790884370603024826 Cited issue https://github.com/apache/arrow/issues/39676 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
