tustvold commented on issue #4328: URL: https://github.com/apache/arrow-rs/issues/4328#issuecomment-2102419130
Sorry I am a bit late to the party here, correctly interpreting the statistics requires more than just [Statistics](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html), as there is additional information that specifies things like sort order, truncation, logical types, etc... It is very likely the existing logic in DF is incorrect, which is fine, but we shouldn't commit to an API here that prevents us doing this correctly. Additionally the API needs to be able to also handle the [Page Index](https://docs.rs/parquet/latest/parquet/file/page_index/index/struct.PageIndex.html) which exposes slightly different information from what is encoded in the file metadata. I don't mean to discourage you, but this is one of the most arcane and subtle areas of parquet and I wonder if it might be worth starting out with something a little simpler as a first contribution? I'd recommend any of the issues marked "good first issue". As it stands this ticket needs extensive research and design work from someone with a good deal of knowledge about parquet, before even getting started on what will likely be pretty complex code. There are still ongoing discussions on parquet-format about correctly interpreting statistics, the standard under-specified a number of key things :sweat_smile:. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
