peter-toth opened a new pull request, #55659:
URL: https://github.com/apache/spark/pull/55659

   ### What changes were proposed in this pull request?
   
   Four bug fixes and two small cleanups in `PlanMerger`:
   
   Bug fixes in `PlanMerger`:
   
   1. Tagged `(Filter, Filter)` reuse preserves `mergedChild`'s appended 
columns:
   When the reuse check finds an existing `propagatedFilter` alias, the branch 
now rebuilds the Filter over `mergedChild` (via 
`cp.withNewChildren(Seq(mergedChild))`) instead of returning `cp` unchanged. If 
the recursion extended `cp.child`'s output with new columns (e.g. a computed `d 
= a + b` from a user Project below the Filter), returning `cp` would drop those 
columns while `npMapping` still pointed into them, leaving the enclosing 
`Aggregate` with unresolved references.
   
   2. `(np: Filter, cp)` create-new does not re-append `cpFilter`:
   `cpFilter`, when set, was produced by a deeper `(np, cp: Filter)` (or 
`(Join, Join)` pass-through) and is already part of `mergedChild`'s output. 
Appending it a second time via `++ cpFilter.toSeq` duplicated the attribute in 
the outer Project's projectList.
   
   3. `(np, cp: Filter)` create-new does not re-append `npFilter`:
   Symmetric to 2. on the np side.
   
   4. `(np, cp: Filter)` with a `MERGED_FILTER_TAG`-tagged `cp` drops the 
tagged Filter:
   `cp`'s condition is `OR(pf_0, pf_1, ...)` and `cp`'s aggregate expressions 
already carry individual `FILTER (WHERE pf_i)` clauses. Synthesising a new 
`propagatedFilter_X = OR(pf_0, pf_1, ...)` would just add `FILTER AND(OR(...), 
pf_i)` wrapping upstream (simplifying to `FILTER pf_i`) plus a redundant alias 
in the Project. The branch now drops `cp`'s Filter and returns `cpFilter = 
None` so `cp`'s aggregates are left untouched.
   
   Cleanups in `PlanMerger.merge()`:
   
   - Unify the local variable name to `newMergedPlan` across all three branches 
(was `newMergedPlan` in one and `newMergePlan` in the other two) -- matches the 
`MergedPlan` case class name.
   - Replace `cache(i).merged` with `mp.merged`; `mp` and `cache(i)` are the 
same object inside the `collectFirst` pattern.
   
   ### Why are the changes needed?
   
   Fix 1. is a correctness bug. Fixes 2.-4. are plan-shape bugs that produce 
duplicated attributes or redundant `OR`-of-propagated-filter aliases in the 
merged plan. The cleanups are minor readability improvements.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. All changes are internal to the optimizer; they produce cleaner merged 
plans for queries that `MergeSubplans` already handled.
   
   ### How was this patch tested?
   
   Four new tests in `MergeSubplansSuite`, one per fix:
   
   - `(np: Filter, cp)` create-new must not re-append cpFilter into the Project 
-- exercises 2. via a `Join` with a `Filter` on the right child, routing a 
cpFilter up through `(Join, Join)` so that `mergedChild.output` already 
contains the attribute the branch used to re-append.
   - `(np, cp: Filter)` create-new must not re-append npFilter into the Project 
-- exercises 3., mirror shape on the np side.
   - tagged `(Filter, Filter)` reuse must keep mergedChild's appended columns 
-- exercises 1. with three subqueries (sq1/sq2 create the tagged structure; 
sq3's Filter sits above a user Project introducing `d = a + b`, so the 
`(Filter, Filter)` tagged recursion extends `mergedChild` with `d`).
   - `(np, cp: Filter)` drops a tagged cp Filter without synthesising a 
redundant alias -- exercises 4. with three subqueries (sq1/sq2 create the 
tagged structure; sq3 has no filter).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Opus 4.7
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to