yaooqinn opened a new pull request, #55708:
URL: https://github.com/apache/spark/pull/55708

   ### What changes were proposed in this pull request?
   
   Extend `NormalizeCTEIds` so that it normalizes `CTERelationRef.cteId` for 
"orphan" references whose matching `CTERelationDef` is not present anywhere in 
the input plan (i.e., the enclosing `WithCTE` lives outside the sub-tree being 
normalized).
   
   The change is intentionally minimal:
   - `apply` first walks the plan to collect the ids of every reachable 
`CTERelationDef`. References whose `cteId` is in that set are paired with their 
def and remain handled by the existing `WithCTE` branch (no behaviour change). 
References whose `cteId` is NOT in that set are treated as orphan and 
pre-assigned a deterministic id from the same shared counter.
   - A new `case ref: CTERelationRef` in `applyInternal` rewrites those orphan 
refs to the assigned id; refs paired with a def are rewritten as before inside 
`canonicalizeCTE` and never re-rewritten.
   
   ### Why are the changes needed?
   
   `CacheManager.{cacheQuery, lookupCachedData}` run `QueryExecution.normalize` 
(whose first rule is `NormalizeCTEIds`) so that two parses of the same SQL 
canonicalize identically and cached entries can be reused. Today 
`NormalizeCTEIds` only rewrites cteIds it discovers inside a `WithCTE` node. 
When a caller probes `CacheManager` with a sub-tree that contains a 
`CTERelationRef` but the corresponding `CTERelationDef` is not in the sub-tree, 
the per-parse cteId leaks into the canonical form and `sameResult` returns 
false on a structurally-identical re-parse, breaking cache lookups for that 
sub-tree. This patch closes that gap so `NormalizeCTEIds` produces a 
parse-stable canonical form for both wrapped and orphan CTE references.
   
   This fix is scoped to the `NormalizeCTEIds` contract; it does not by itself 
remove all sources of canonical instability across parses (e.g. operand 
ordering of commutative expressions or canonicalization of subquery 
broadcast/dynamic-pruning wrappers), which are tracked separately.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. `NormalizeCTEIds` is an internal plan-normalization rule. Plans without 
orphan `CTERelationRef` produce byte-identical output to before; plans with 
orphan refs only see their `cteId` rewritten to a deterministic value.
   
   ### How was this patch tested?
   
   New `NormalizeCTEIdsSuite` with two unit tests:
   - existing behaviour: `WithCTE`-wrapped plans canonicalize identically 
across parses (regression guard for the unchanged path).
   - new behaviour: orphan `CTERelationRef` sub-trees with structurally 
identical bodies but different per-parse `cteId` values canonicalize 
identically and are considered `sameResult` after `NormalizeCTEIds`.
   
   New `CacheManagerSuite` test that parses a `WITH inner_cte AS ..., outer_cte 
AS (... (SELECT m FROM inner_cte))` SQL twice, extracts `outer_cte`'s body (an 
orphan `CTERelationRef` sub-tree) and verifies `QueryExecution.normalize` 
produces the same canonical form / `sameResult` across parses.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: GitHub Copilot CLI, model: Claude Opus 4.7 1M context 
(claude-opus-4.7-1m-internal)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to