[PR] OptimizeProjections: prune struct-only UNNEST when outputs are unused and ancestors are multiplicity-insensitive [datafusion]

via GitHub Tue, 03 Mar 2026 04:30:29 -0800


kosiew opened a new pull request, #20668:
URL: https://github.com/apache/datafusion/pull/20668


   ## Which issue does this PR close?
   
   * Part of #20118.
   
   ---
   
   ## Rationale for this change
   
   Logical `UNNEST` can sometimes be proven unnecessary when it does not 
contribute any required output columns and when removing it cannot change query 
results.
   
   However, `UNNEST` is **not** always safe to remove:
   
   * List/array unnest can change row cardinality (e.g., empty lists / nulls) 
even if the unnested column is not referenced.
   * Some operators/expressions (e.g., `count(*)`, certain windows) are 
**multiplicity-sensitive**, meaning they can observe row-count changes.
   * Volatile expressions (non-deterministic / time-dependent) must not have 
their evaluation frequency changed by cardinality rewrites.
   
   This PR adds strict, logical-level gating so `UNNEST` is eliminated **only** 
when these semantic hazards are ruled out.
   
   ---
   
   ## What changes are included in this PR?
   
   * **Track multiplicity/volatility context through projection pruning**
   
     * Extend `RequiredIndices` with:
   
       * `multiplicity_sensitive`: whether ancestor operators can observe 
row-multiplicity changes.
       * `has_volatile_ancestor`: whether any ancestor expression is volatile.
     * Propagate volatility presence from each plan node 
(`plan.expressions().any(Expr::is_volatile)`) into child requirements.
     * Mark children as multiplicity-sensitive/insensitive depending on the 
operator context (e.g., aggregate/window/join paths).
   
   * **Logical pruning of `LogicalPlan::Unnest` when safe**
   
     * Add `can_eliminate_unnest` + helpers:
   
       * Require **not multiplicity-sensitive** and **no volatile ancestors**.
       * Disallow elimination when list/array unnest is present (conservative 
due to empty-list/null row-dropping semantics).
       * Ensure all required output indices are passthrough mappings to input 
columns via `dependency_indices` and qualified-field equality.
     * If eligible, replace `Unnest` with its input and compute child 
requirements using passthrough dependencies.
     * Otherwise, keep existing behavior and make the child requirement 
explicitly multiplicity-sensitive.
   
   * **Tests**
   
     * Unit tests:
   
       * Eliminate struct unnest when only group keys are required.
       * Keep list unnest even when only group keys are required.
       * Keep unnest when aggregates depend on multiplicity (`count(1)` / 
`count(*)`).
       * Keep unnest when `preserve_nulls` is disabled.
     * SQLLogicTest coverage (`optimizer_unnest_prune.slt`):
   
       * `EXPLAIN` asserts `Unnest` is removed only for safe struct case.
       * `EXPLAIN` asserts `Unnest` remains for list case and for 
multiplicity-sensitive count.
       * Correctness checks validate empty-list/null behavior and 
multiplicity-sensitive counts.
   
   ---
   
   ## Are these changes tested?
   
   Yes.
   
   * Added **unit tests** in the optimizer module covering both positive (safe 
elimination) and negative (must keep) scenarios.
   * Added **SQLLogicTests** validating:
   
     * Explain-plan rewrites (`Unnest` absent only in the safe struct case).
     * Result correctness for empty-list/null semantics and 
multiplicity-sensitive aggregates.
   
   ---
   
   ## Are there any user-facing changes?
   
   Yes (query-plan / performance behavior).
   
   * In eligible queries, `EXPLAIN` will no longer show `Unnest` and execution 
may be faster due to avoiding unnecessary expansion.
   * No SQL syntax or API changes.
   * Semantics are preserved by construction via strict safety checks; 
list/array cases remain conservative.
   
   ---
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] OptimizeProjections: prune struct-only UNNEST when outputs are unused and ancestors are multiplicity-insensitive [datafusion]

Reply via email to