Dandandan opened a new pull request, #21766:
URL: https://github.com/apache/datafusion/pull/21766
## Which issue does this PR close?
Follow-up to #21351 (Dynamic work scheduling in FileStream), which closed
#20529 and explicitly deferred *\"splitting files into smaller units (e.g.
across row groups)\"* as future work. This PR implements that.
- Closes #.
## Rationale for this change
With #21351, sibling FileStreams already steal **whole files** from a
`SharedWorkSource` queue. But a single large parquet file still bottlenecks on
one worker — the other N−1 sibling partitions sit idle even though each row
group is independently readable. This shows up on single-file queries
(ClickBench-style) and on the long-tail large-file case in multi-file scans.
This PR adds row-group granularity: the worker that pops a file donates its
other row groups back to the shared queue so idle siblings steal them.
## What changes are included in this PR?
**Donation path** (`datafusion/datasource-parquet/src/opener.rs`):
- New \`ParquetOpenState::SplitAndDonate\` state between \`LoadMetadata\`
and \`PrepareFilters\`. After metadata load, the donor keeps the first eligible
row group; each remaining one is pushed to the front of the shared queue as a
\`PartitionedFile\` clone whose \`range\` is a one-byte \`FileRange\` at that
row group's starting offset.
- The existing \`prune_by_range\` path matches that offset and scopes the
stealer to exactly that row group — no new extension types, no metadata carried
through \`PartitionedFile.extensions\`, no access-plan donation.
- If the caller pre-narrowed the scan with a \`file_range\` that still spans
multiple row groups (byte-range file partitioning), splitting stays **inside**
that range: donated ranges remain subsets of the caller's.
- Guards:
- Caller-supplied \`ParquetAccessPlan\` in \`extensions\` → respected
as-is, no donation.
- Single row group in scope (whole file, or caller range isolating one RG)
→ no donation.
**Shared queue plumbing**:
- \`SharedWorkSource\` is now \`pub\`; gains \`push_front(items)\`,
\`pop_front()\`, and \`Default\`.
- \`FileSource::create_morselizer\` takes an extra
\`Option<SharedWorkSource>\` parameter so format-specific morselizers can
participate in donation. Non-parquet sources ignore it.
- \`row_group_start_offset\` helper is extracted into
\`row_group_filter.rs\` and reused by both \`prune_by_range\` and the new
donation path.
**Trade-offs** (v1):
- Stealers re-read the parquet footer for their chunk. Object stores
typically cache the range so this is cheap; carrying loaded metadata across
siblings is left for a follow-up.
- If a sibling drains the shared queue *before* the donor has donated, that
sibling terminates (it observes an empty queue at \`scan_state.rs\`'s
\`ScanAndReturn::Done\`). Accepted for v1; fixing requires splitter-handles /
queue wakeup and can be added separately.
## Are these changes tested?
Yes. Five new unit tests in \`datafusion/datasource-parquet/src/opener.rs\`:
- \`row_group_split_donates_remaining_row_groups\` — donor reads RG 0; three
donated chunks each read exactly their row group, in file order.
- \`row_group_split_skips_single_row_group_file\` — no donation when the
file has one row group.
- \`row_group_split_respects_caller_access_plan\` — \`ParquetAccessPlan\` in
extensions suppresses donation; caller plan executes as specified.
- \`row_group_split_within_caller_file_range\` — caller byte range covering
all RGs is split; donated ranges stay inside the caller range.
- \`row_group_split_skips_when_caller_range_covers_single_row_group\` —
narrow caller range isolating one RG suppresses donation.
All existing \`datafusion-datasource\` and \`datafusion-datasource-parquet\`
tests continue to pass. \`cargo clippy --all-targets --all-features -- -D
warnings\` is clean on both crates.
## Are there any user-facing changes?
Performance only — faster single-file and tail-file scans under sibling work
stealing. No semantic or API changes visible to SQL users. \`SharedWorkSource\`
becomes \`pub\` (it was \`pub(crate)\`); \`FileSource::create_morselizer\`
gains one parameter — default implementations ignore it.
---
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]