Dandandan opened a new pull request, #22389:
URL: https://github.com/apache/datafusion/pull/22389

   ## Summary
   
   Long chains of `ProjectionExec`s (and their `LogicalPlan::Projection` 
counterparts) — e.g. queries shaped like:
   
   ```sql
   WITH s0 AS (SELECT ... FROM t LEFT JOIN ...),
        s1 AS (SELECT *, <CASE-ladder over s0.col> AS d1 FROM s0),
        s2 AS (SELECT *, <CASE-ladder over s0.col> AS d2 FROM s1),
        ...
        sN AS (SELECT *, <CASE-ladder over s0.col> AS dN FROM s_{N-1})
   SELECT * FROM sN
   ```
   
   were being collapsed one level at a time. This forced **O(N) intermediate 
`ProjectionExec` constructions** (each recomputing equivalence properties 
through its projection mapping) on the physical side, and **O(N) re-runs of the 
`OptimizeProjections` rule** by the outer optimizer fixpoint loop on the 
logical side.
   
   ## Fix
   
   **Physical** (`datafusion/physical-plan/src/projection.rs`):
   - Replaced the pairwise recursion inside 
`ProjectionExec::try_swapping_with_projection` with 
`try_collapse_projection_chain`, which walks the entire run of consecutive 
`ProjectionExec`s iteratively and builds **one** final `ProjectionExec` instead 
of N-1 intermediates.
   - The `is_projection_removable` / merge-is-beneficial guards from the old 
`try_unifying_projections` are preserved at each step.
   - Leaf pushdown into a non-`Projection` input (e.g. `DataSourceExec`) is 
preserved by handing the unified projection back to 
`remove_unnecessary_projections` once at the end.
   
   **Logical** (`datafusion/optimizer/src/optimize_projections/mod.rs`):
   - Wrapped `merge_consecutive_projections` in an internal loop so an N-deep 
chain of `LogicalPlan::Projection` collapses in a single rule application.
   
   ## Benchmark
   
   New `physical_plan_chained_case_projection_hotspot` bench in 
`sql_planner_extended.rs`. Models the shape above: LEFT JOIN + 30-wide OR 
filter + 80 chained CTE projections, each adding one column defined by a 
depth-23 nested `CASE` ladder over the same input column.
   
   | Config | Before | After | Δ |
   |---|---:|---:|---|
   | N=80, depth=23, OR=30 (hotspot) | 623 ms | **364 ms** | **−41.5%** (CI 
[−41.83%, −41.33%], p<0.05) |
   | N=80, depth=45 (parser ceiling) | (multi-second, was the original blowup) 
| **~1.5 s** | — |
   
   A small sweep group (`physical_plan_chained_case_projection_sweep`) covers 
steps ∈ {10, 20, 40} × depth ∈ {5, 10, 15} for tracking regressions in this 
shape over time.
   
   ## Verification
   
   - `cargo test -p datafusion-physical-plan --lib projection` — 24/24 passing
   - `cargo test -p datafusion-optimizer --lib optimize_projections` — 54/54 
passing
   - `cargo test -p datafusion --test core_integration 
physical_optimizer::projection_pushdown` — 23/23 passing (no snapshots changed)
   
   ## Test plan
   
   - [x] Existing projection unit tests
   - [x] Existing physical optimizer integration tests
   - [ ] CI runs the new bench
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to