alamb opened a new issue, #10934: URL: https://github.com/apache/datafusion/issues/10934
### Is your feature request related to a problem or challenge? As we work to make extracting statistics from parquet data pages more correct and performant in https://github.com/apache/datafusion/issues/10922 one thing that would be good is to have benchmark overage ### Describe the solution you'd like Add a benchmark for extracting page statistics ### Describe alternatives you've considered Add a benchmark ([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L152)) for extracting data page statistics These are run via ```shell cargo bench --bench parquet_statistic ``` In order to create a reasonable number of data page staistics, it would be good to configure the parquet writer to limit the sizez of data pages https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L75 And use https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit to set the the limit to 1 and then send the data in row by row as we did in the test: https://github.com/apache/datafusion/blob/d175163ef6442056d8210de9b0e28e264c39ca2c/datafusion/core/tests/parquet/arrow_statistics.rs#L105-L130 ### Additional context The need for a benchmark also came up in https://github.com/apache/datafusion/pull/10932 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org