[PR] Prefetch next Parquet row group data while decoding current one [datafusion]

via GitHub Sat, 04 Apr 2026 14:32:30 -0700


Dandandan opened a new pull request, #21373:
URL: https://github.com/apache/datafusion/pull/21373


   ## Summary
   - Uses `ParquetPushDecoder::try_next_reader()` to obtain a synchronous 
`ParquetRecordBatchReader` for the current row group, then spawns IO for the 
next row group via `tokio::spawn` while iterating batches
   - This overlaps IO with compute, avoiding idle time between row groups that 
was visible in profiling traces
   - The arrow-rs `try_next_reader()` API was designed for exactly this 
pipelining pattern
   
   ## Test plan
   - [x] `cargo test -p datafusion-datasource-parquet` (97 passed)
   - [x] `cargo test -p datafusion --test parquet_integration` (198 passed)
   - [x] `cargo test -p datafusion --test core_integration -- parquet` (28 
passed)
   - [ ] Benchmark with dfbench to measure improvement
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Prefetch next Parquet row group data while decoding current one [datafusion]

Reply via email to