JigaoLuo commented on issue #16374: URL: https://github.com/apache/datafusion/issues/16374#issuecomment-2993567391
@alamb @zhuqi-lucas Thank you for this issue and the PR. This could significantly aid query processing on Parquet. I was previously **never** aware of `key_value_metadata` and am grateful for the insight: today marks my first discovery of its presence in both [ColumnMetaData](https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L900) and [FileMetaData](https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L1267). Also @alamb's argument also reminded me of a paper from the German DB Conference: https://dl.gi.de/server/api/core/bitstreams/9c8435ee-d478-4b0e-9e3f-94f39a9e7090/content for reference and <details> <summary>at the end of Section 2.3 of it:</summary> > The only statistics available in Parquet files are the cardinality of the contained dataset and each page’s minimum and maximum values. Unfortunately, the minimum and maximum values are optional fields, so Parquet writers are not forced to use them. ... These minimum and maximum values, as well as the cardinality of the datasets, are the only sources available for performing cardinality estimates. Therefore, we get imprecise results since we do not know how the data is distributed within the given boundaries. As a consequence, we get erroneous cardinality estimates and suboptimal query plans. > ... This shows how crucial a good cardinality estimate is for a Parquet scan to be an acceptable alternative to database relations. The Parquet scan cannot get close to the execution times of database relations as long as the query optimizer cannot choose the same query plans for the Parquet files </details> In my experience, there’s a **widespread underappreciation for the configurability of Parquet files**. Many practitioners default to blaming Parquet’s performance or feature limitations, such as HLL. This often leads to unfair comparisons with proprietary formats, which are fine-tuned and cherry-picked. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org