alamb opened a new issue, #10934:
URL: https://github.com/apache/datafusion/issues/10934

   ### Is your feature request related to a problem or challenge?
   
   As we work to make extracting statistics from parquet data pages more 
correct and performant in  https://github.com/apache/datafusion/issues/10922 
one thing that would be good is to have benchmark overage
   
   
   
   ### Describe the solution you'd like
   
   
   
   Add a benchmark for extracting page statistics 
   
   ### Describe alternatives you've considered
   
   Add a benchmark 
([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L152))
 for extracting data page statistics
   
   These are run via
   
   ```shell
   cargo bench --bench parquet_statistic
   ```
   
   
   In order to create a reasonable number of data page staistics, it would be 
good to configure the parquet writer to limit the sizez of data pages
   
   
https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L75
   
   
   And use 
https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit
 to set the the limit to 1 and then send the data in row by row as we did in 
the test:
   
   
https://github.com/apache/datafusion/blob/d175163ef6442056d8210de9b0e28e264c39ca2c/datafusion/core/tests/parquet/arrow_statistics.rs#L105-L130
   
   ### Additional context
   
   The need for a benchmark also came up in 
https://github.com/apache/datafusion/pull/10932


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to