JigaoLuo commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-2993567391

   @alamb @zhuqi-lucas Thank you for this issue and the PR. This could 
significantly aid query processing on Parquet. 
   
   I was previously **never** aware of `key_value_metadata` and am grateful for 
the insight: today marks my first discovery of its presence in both 
[ColumnMetaData](https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L900)
 and 
[FileMetaData](https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L1267).
 Also @alamb's argument also reminded me of a paper from the German DB 
Conference: 
https://dl.gi.de/server/api/core/bitstreams/9c8435ee-d478-4b0e-9e3f-94f39a9e7090/content
  for reference and
   
   <details>
   
   <summary>at the end of Section 2.3 of it:</summary>
   
   > The only statistics available in Parquet files are the cardinality of the 
contained dataset and
   each page’s minimum and maximum values. Unfortunately, the minimum and 
maximum values are optional fields, so Parquet writers are not forced to use 
them. ...  These minimum and maximum values, as well as the cardinality of the 
datasets, are the only sources available for performing cardinality estimates. 
Therefore, we get imprecise results since we do not know how the data is 
distributed within the given boundaries. As a consequence, we get erroneous 
cardinality estimates and suboptimal query plans.
   
   > ...  This shows how crucial a good cardinality estimate is for a Parquet 
scan to be
   an acceptable alternative to database relations. The Parquet scan cannot get 
close to the
   execution times of database relations as long as the query optimizer cannot 
choose the same query plans for the Parquet files
   
   </details>
   
   In my experience, there’s a **widespread underappreciation for the 
configurability of Parquet files**. Many practitioners default to blaming 
Parquet’s performance or feature limitations, such as HLL. This often leads to 
unfair comparisons with proprietary formats, which are fine-tuned and 
cherry-picked.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to