samuelcolvin opened a new issue, #6454: URL: https://github.com/apache/arrow-rs/issues/6454
**Describe the bug** I noticed this while investigating https://github.com/apache/datafusion/issues/7845#issuecomment-2370455772. The suggestion from @jayzhan211 and @alamb was that `datafusion.execution.parquet.pushdown_filters true` should improve performance of queries like this, but it seems to make them slower. I think the reason is that data is being decompressed twice (or data is being decompressed that shouldn't be), here's a screenshot from samply running on [this code](https://github.com/samuelcolvin/batson-perf): <img width="1596" alt="image" src="https://github.com/user-attachments/assets/b3268dd8-8264-4cd4-972c-0ed3f20a3a4c"> (You can view this flamegraph properly [here](https://share.firefox.dev/3zrdUpN)) You can see that there are two blocks of decompression work, the second one is associated with `parquet::column::reader::GenericColumnReader::skip_records` and happens after the first decompression chunk and running the query has completed. In particular you can se that there's a `read_new_page()` cal in ` parquet::column::reader::GenericColumnReader::skip_records` (line 335) that's taking a lot of time: <img width="798" alt="image" src="https://github.com/user-attachments/assets/abfce516-1eae-4ac3-a240-1a0686a37fe4"> My question is - could this second run of compression be avoided? **To Reproduce** Clone https://github.com/samuelcolvin/batson-perf, comment out one of the modes, compile with profiling enabled `cargo build --profile profiling`, run with samply `samply record ./target/profiling/batson-perf` **Expected behavior** I would expect that `datafusion.execution.parquet.pushdown_filters true` was faster, I think the reason it's not is decompressing the data twice. **Additional context** https://github.com/apache/datafusion/issues/7845#issuecomment-2370455772 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
