Dandandan opened a new pull request, #21731:
URL: https://github.com/apache/datafusion/pull/21731
## Purpose
Test branch combining:
- **#21351** (base) — Dynamic work scheduling in FileStream (inter-file work
stealing across sibling partitions)
- **#21580** (merged on top) — Reorder row groups by statistics during sort
pushdown (intra-file RG reorder for TopK)
The goal is to measure whether the two optimizations compound on TopK-style
queries (e.g. `ORDER BY col LIMIT N` on multi-file / multi-RG parquet), since
they operate at different granularities:
- #21351 balances **files** across partitions at runtime via a shared work
queue.
- #21580 reorders **row groups within a file** so TopK sees the best values
first, tightening the dynamic filter threshold earlier — which then propagates
across partitions via the filter and can amplify #21351's work-stealing gains.
## Conflict resolution
Only one real conflict: `datafusion/datasource/src/source.rs`.
#21580 had been rebased on top of upstream #21576 (which removes the
explicit `as_any` method on `DataSource` in favor of an `Any` supertrait
bound). #21351's base predates #21576, so it still uses the explicit `as_any`
method.
Resolved in favor of #21351's style:
- Kept `DataSource: Send + Sync + Debug` with an `as_any(&self) -> &dyn Any`
method.
- Restored `as_any` impls on `FileScanConfig` and `MemorySourceConfig`.
- Added `use std::any::Any` imports in `file_scan_config/mod.rs` and
`memory.rs`.
- Rewrote `dyn DataSource::{is, downcast_ref}` helpers to call
`self.as_any()` (with a `T: 'static` bound).
## Status
**Draft — not for merge.** This is an integration branch for benchmarking.
The two PRs should be reviewed and merged independently upstream.
- [x] `cargo check --workspace` passes
- [ ] `cargo test` not yet run
- [ ] Benchmarks not yet run
## Follow-ups
- Run ClickBench (especially Q23–Q26) and the sort-pushdown benchmarks
(#21582) on this combined branch vs. each PR individually to quantify the
compounding effect.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]