SubhamSinghal opened a new pull request, #22885:
URL: https://github.com/apache/datafusion/pull/22885
## Which issue does this PR close?
- Related to https://github.com/apache/datafusion/issues/6899 (DENSE_RANK
to follow in a separate PR).
## Rationale for this change
PR #21479 introduced `WindowTopN` for `ROW_NUMBER` only; `RANK` and
`DENSE_RANK` were explicitly out of scope. This PR extends the rule to `RANK`,
replacing the full sort under `Filter(rk≤K) → Window(RANK) → Sort` with a
per-partition heap-of-K plus a boundary-tie buffer.
## What changes are included in this PR?
- `physical-plan/src/topk/mod.rs` — new `pub(crate)` `TopKAddOutcome` /
`EvictedRow` / `TopK::peek_bytes` /
`TopK::insert_row_with_outcome`. `TopKHeap::add` now returns
`Option<EvictedRow>` (capture-before-unuse with same-batch discriminator).
- `physical-plan/src/sorts/partitioned_topk.rs` — new `WindowFnKind`
field, `PartitionState { RowNumber(TopK),
Rank(PartitionRankHeap) }` dispatch, `PartitionRankHeap` + `TieEntry`
(batched ties via `Vec<u32>` of indices per source batch).
- `physical-optimizer/src/window_topn.rs` — `is_row_number` →
`supported_window_fn(expr) -> Option<WindowFnKind>`; empty-`order_by` guard for
RANK.
- SLT + integration tests covering: basic, strict (`<`), flipped
(`>=`/`>`), boundary ties, ties spanning ob values,
empty-`ORDER BY` (rule must NOT fire), mixed window functions, ASC/DESC ×
NULLS FIRST/LAST, QUALIFY, dense_rank-skip.
h2o `window` benchmark, 10M-row `large` table, RANK top-2, 3-iteration
average. Toggle via `DATAFUSION_OPTIMIZER_ENABLE_WINDOW_TOPN`.
| Partitions | Variant | OFF (rule disabled) | ON (rule enabled) | Δ |
|---:|---|---:|---:|---|
| 100 | RANK low ties | 326 ms | **121 ms** | **2.70× faster** ✓ |
| 1K | RANK low ties | 266 ms | **118 ms** | **2.26× faster** ✓ |
| 1K | RANK heavy ties | 285 ms | **132 ms** | **2.16× faster** ✓ |
| 10K | RANK low ties | 250 ms | 355 ms | 1.42× slower |
| 10K | RANK heavy ties | 305 ms | 494 ms | 1.62× slower |
| 100K | RANK low ties | 250 ms | 2,498 ms | 9.98× slower |
## Are these changes tested?
Yes:
- `cargo test -p datafusion-physical-plan --lib` — 1455 passed
- `cargo test -p datafusion-physical-optimizer --lib` — 27 passed
- `cargo test -p datafusion --test core_integration
physical_optimizer::window_topn::` — 13 passed (7 ROW_NUMBER + 6 RANK)
- `cargo test --test sqllogictests -- window_topn` — passed
## Are there any user-facing changes?
The existing `optimizer.enable_window_topn` config flag (default `false`)
now also covers `RANK` queries. No public API additions
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]