[I] `parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data [arrow-rs]

via GitHub Wed, 25 Sep 2024 02:18:35 -0700


samuelcolvin opened a new issue, #6454:
URL: https://github.com/apache/arrow-rs/issues/6454


   **Describe the bug**
   
   I noticed this while investigating 
https://github.com/apache/datafusion/issues/7845#issuecomment-2370455772.
   
   The suggestion from @jayzhan211 and @alamb was that 
`datafusion.execution.parquet.pushdown_filters true` should improve performance 
of queries like this, but it seems to make them slower.
   
   I think the reason is that data is being decompressed twice (or data is 
being decompressed that shouldn't be), here's a screenshot from samply running 
on [this code](https://github.com/samuelcolvin/batson-perf):
   
   <img width="1596" alt="image" 
src="https://github.com/user-attachments/assets/b3268dd8-8264-4cd4-972c-0ed3f20a3a4c";>
   
   (You can view this flamegraph properly 
[here](https://share.firefox.dev/3zrdUpN))
   
   You can see that there are two blocks of decompression work, the second one 
is associated with `parquet::column::reader::GenericColumnReader::skip_records` 
and happens after the first decompression chunk and running the query has 
completed.
   
   In particular you can se that there's a `read_new_page()` cal in `
   parquet::column::reader::GenericColumnReader::skip_records` (line 335) 
that's taking a lot of time:
   
   <img width="798" alt="image" 
src="https://github.com/user-attachments/assets/abfce516-1eae-4ac3-a240-1a0686a37fe4";>
   
   My question is - could this second run of compression be avoided?
   
   **To Reproduce**
   
   Clone https://github.com/samuelcolvin/batson-perf, comment out one of the 
modes, compile with profiling enabled `cargo build --profile profiling`, run 
with samply `samply record ./target/profiling/batson-perf`
   
   **Expected behavior**
   
   I would expect that `datafusion.execution.parquet.pushdown_filters true` was 
faster, I think the reason it's not is decompressing the data twice.
   
   **Additional context**
   
   https://github.com/apache/datafusion/issues/7845#issuecomment-2370455772


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] `parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data [arrow-rs]

Reply via email to