Dandandan opened a new pull request, #21682:
URL: https://github.com/apache/datafusion/pull/21682
## Which issue does this PR close?
Stacks on top of #21351.
## Rationale for this change
PR #21351 enables dynamic work scheduling in FileStream but keeps the same
single-outstanding-I/O-per-partition property as main. This PR implements the
follow-on item @alamb listed:
> 2. Trying to issue multiple IOs by the same partition (aka to interleave
IO and CPU work more)
It lets each partition prefetch upcoming files while the active reader
decodes the current file, so planner I/O is no longer serialized within a
partition.
## What changes are included in this PR?
1. New `FileStreamState::Prefetch` variant and `PrefetchState` that drives
multiple `PendingMorselPlanner` I/Os concurrently and issues planner I/O for
upcoming files while the active reader is blocked.
2. Prefetching is bounded at `MAX_PREFETCH_MORSELS = 20` in-flight
morsel-producing work items (pending I/O + ready planners + ready morsels +
active reader) to cap buffering.
3. Enabled by default via `FileStreamBuilder`; the legacy single-I/O
`ScanState` path is preserved and opt-in-able via
`FileStreamBuilder::with_prefetch(false)`.
4. Two new snapshot tests:
- `morsel_prefetch_overlaps_io_across_files` — verifies file2's planner
I/O is issued while file1's I/O is still pending.
- `morsel_no_prefetch_keeps_files_sequential` — verifies
`with_prefetch(false)` preserves the legacy single-I/O behavior.
The reader takes priority over prefetching (step order: poll pending I/O →
poll reader → plan → promote morsel → morselize next file), so user-visible
latency is not delayed by opening new files, and all existing snapshot tests
pass unchanged.
## Are these changes tested?
Yes — 27 file_stream tests pass, including the two new prefetch-specific
tests. Full `datafusion-datasource` and `datafusion` crate test suites pass
locally. Clippy is clean on the affected crates.
## Are there any user-facing changes?
Yes — prefetching is on by default, so multi-file scans may now have
multiple planner I/Os in flight per partition. Users can opt out via
`FileStreamBuilder::with_prefetch(false)`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]