kosiew opened a new pull request, #20668:
URL: https://github.com/apache/datafusion/pull/20668
## Which issue does this PR close?
* Part of #20118.
---
## Rationale for this change
Logical `UNNEST` can sometimes be proven unnecessary when it does not
contribute any required output columns and when removing it cannot change query
results.
However, `UNNEST` is **not** always safe to remove:
* List/array unnest can change row cardinality (e.g., empty lists / nulls)
even if the unnested column is not referenced.
* Some operators/expressions (e.g., `count(*)`, certain windows) are
**multiplicity-sensitive**, meaning they can observe row-count changes.
* Volatile expressions (non-deterministic / time-dependent) must not have
their evaluation frequency changed by cardinality rewrites.
This PR adds strict, logical-level gating so `UNNEST` is eliminated **only**
when these semantic hazards are ruled out.
---
## What changes are included in this PR?
* **Track multiplicity/volatility context through projection pruning**
* Extend `RequiredIndices` with:
* `multiplicity_sensitive`: whether ancestor operators can observe
row-multiplicity changes.
* `has_volatile_ancestor`: whether any ancestor expression is volatile.
* Propagate volatility presence from each plan node
(`plan.expressions().any(Expr::is_volatile)`) into child requirements.
* Mark children as multiplicity-sensitive/insensitive depending on the
operator context (e.g., aggregate/window/join paths).
* **Logical pruning of `LogicalPlan::Unnest` when safe**
* Add `can_eliminate_unnest` + helpers:
* Require **not multiplicity-sensitive** and **no volatile ancestors**.
* Disallow elimination when list/array unnest is present (conservative
due to empty-list/null row-dropping semantics).
* Ensure all required output indices are passthrough mappings to input
columns via `dependency_indices` and qualified-field equality.
* If eligible, replace `Unnest` with its input and compute child
requirements using passthrough dependencies.
* Otherwise, keep existing behavior and make the child requirement
explicitly multiplicity-sensitive.
* **Tests**
* Unit tests:
* Eliminate struct unnest when only group keys are required.
* Keep list unnest even when only group keys are required.
* Keep unnest when aggregates depend on multiplicity (`count(1)` /
`count(*)`).
* Keep unnest when `preserve_nulls` is disabled.
* SQLLogicTest coverage (`optimizer_unnest_prune.slt`):
* `EXPLAIN` asserts `Unnest` is removed only for safe struct case.
* `EXPLAIN` asserts `Unnest` remains for list case and for
multiplicity-sensitive count.
* Correctness checks validate empty-list/null behavior and
multiplicity-sensitive counts.
---
## Are these changes tested?
Yes.
* Added **unit tests** in the optimizer module covering both positive (safe
elimination) and negative (must keep) scenarios.
* Added **SQLLogicTests** validating:
* Explain-plan rewrites (`Unnest` absent only in the safe struct case).
* Result correctness for empty-list/null semantics and
multiplicity-sensitive aggregates.
---
## Are there any user-facing changes?
Yes (query-plan / performance behavior).
* In eligible queries, `EXPLAIN` will no longer show `Unnest` and execution
may be faster due to avoiding unnecessary expansion.
* No SQL syntax or API changes.
* Semantics are preserved by construction via strict safety checks;
list/array cases remain conservative.
---
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]