zhuqi-lucas opened a new issue, #22405:
URL: https://github.com/apache/datafusion/issues/22405
## Is your feature request related to a problem or challenge?
DataFusion's `skip_partial_aggregation` already runtime-samples: after
`skip_partial_aggregation_probe_rows_threshold` rows (default 100k), it checks
the measured group/row ratio and switches into skip mode if the ratio exceeds
`skip_partial_aggregation_probe_ratio_threshold` (default **0.8**).
The heuristic is too coarse for queries where partial aggregation is
**net-negative but doesn't hit the 0.8 bar**. Concrete example — ClickBench Q18:
```sql
SELECT "UserID",
extract(minute FROM to_timestamp_seconds("EventTime")) AS m,
"SearchPhrase",
COUNT(*)
FROM hits
GROUP BY "UserID", m, "SearchPhrase"
ORDER BY COUNT(*) DESC
LIMIT 10;
```
On the standard 100M-row `hits.parquet` (12 partitions, M-series MacBook,
release build, hot cache, median of 5 runs):
| Config | Q18 elapsed |
|---|---|
| Default (`ratio_threshold = 0.8`) | **2.72 s** |
| `ratio_threshold = 0.6`, `probe_rows = 5000` | **1.57 s** (**1.73×
faster**) |
`EXPLAIN ANALYZE` shows why — the measured ratio is **0.565** (≈ 50.88 M
groups / 90.12 M rows), well below the 0.8 cut-off, so partial aggregation
keeps running. It costs **17 s of compute** (summed across 12 partitions) and
reduces input by only ~40 %, then the final aggregate still has to chew through
60 M rows. With skipping, partial agg drops to **1.27 s** and the final stage
actually runs faster too (better cache locality without the partial-state
lookup overhead).
A single fixed threshold cannot capture this — the right decision depends on
the **absolute time saved by partial aggregation** vs the **absolute time spent
doing partial aggregation**, which varies with input rate, group state size,
hash function cost, and downstream work. A query with ratio 0.4 on small group
state may still benefit from partial; another with ratio 0.6 on heavy
variable-length keys may not.
## Describe the solution you'd like
Replace the fixed ratio check with a **cost-aware adaptive decision**:
1. **Measure both sides** during the probe window:
- `partial_agg_time_per_row` — wall time spent on hash-probe + insert per
input row
- `passthrough_time_per_row` — extrapolated time of just forwarding the
batch
- `output_rows_per_input_row` — actual reduction factor
2. **Estimate net benefit** = `passthrough_time × output_rows` (cost of
larger downstream) vs `partial_agg_time × input_rows` (cost of doing partial).
If partial is net-negative beyond a margin, switch to skip.
3. **Re-probe periodically** — every N batches re-evaluate, so a query whose
distribution shifts mid-stream can switch back.
4. **Bound the overhead** — the probe itself should add ≤ 1 % overhead in
the worst case.
This subsumes the current `ratio_threshold` (which can stay as a fallback /
safety check).
### Related code paths
- `datafusion/physical-plan/src/aggregates/row_hash.rs:441` —
`skip_aggregation_probe` field
- `datafusion/physical-plan/src/aggregates/row_hash.rs:635` — current
probe-and-switch logic
- `datafusion/common/src/config.rs` — `skip_partial_aggregation_probe_*`
options
- Tests for current behavior: `test_skip_aggregation_after_first_batch`,
`test_skip_aggregation_after_threshold` in `aggregates/mod.rs`
### Why this matters
- ClickBench Q18 (and likely Q35, other count-by-many-keys queries) hit this
regularly
- Removes a magic knob that users have to discover and tune per-query
- Aligns with the cost-based query optimization direction the optimizer is
moving toward elsewhere
## Describe alternatives you've considered
- **Lower the default `ratio_threshold` to 0.6** — simple one-line change
but still magic; would need full ClickBench validation to ensure no regression
elsewhere. Could be a stepping-stone before the adaptive version.
- **Per-aggregate-shape heuristic** (different threshold for different group
key types / aggregate functions) — more rules to maintain, doesn't generalize.
- **Pure pass-through (always skip)** — would regress low-cardinality cases
where partial is genuinely useful (Q35-shape, where ClientIP cardinality is
moderate and partial agg compacts effectively).
## Additional context
- Related issue: #13449 (Improve performance of ClickBench Q18, Q35)
- Related issue: #13729 (Improve Aggregate with Limit — orthogonal but in
the same area)
- Umbrella: #18489 ([EPIC] Make DataFusion top of ClickBench Parquet
leaderboard)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]