yaooqinn opened a new pull request, #55708:
URL: https://github.com/apache/spark/pull/55708
### What changes were proposed in this pull request?
Extend `NormalizeCTEIds` so that it normalizes `CTERelationRef.cteId` for
"orphan" references whose matching `CTERelationDef` is not present anywhere in
the input plan (i.e., the enclosing `WithCTE` lives outside the sub-tree being
normalized).
The change is intentionally minimal:
- `apply` first walks the plan to collect the ids of every reachable
`CTERelationDef`. References whose `cteId` is in that set are paired with their
def and remain handled by the existing `WithCTE` branch (no behaviour change).
References whose `cteId` is NOT in that set are treated as orphan and
pre-assigned a deterministic id from the same shared counter.
- A new `case ref: CTERelationRef` in `applyInternal` rewrites those orphan
refs to the assigned id; refs paired with a def are rewritten as before inside
`canonicalizeCTE` and never re-rewritten.
### Why are the changes needed?
`CacheManager.{cacheQuery, lookupCachedData}` run `QueryExecution.normalize`
(whose first rule is `NormalizeCTEIds`) so that two parses of the same SQL
canonicalize identically and cached entries can be reused. Today
`NormalizeCTEIds` only rewrites cteIds it discovers inside a `WithCTE` node.
When a caller probes `CacheManager` with a sub-tree that contains a
`CTERelationRef` but the corresponding `CTERelationDef` is not in the sub-tree,
the per-parse cteId leaks into the canonical form and `sameResult` returns
false on a structurally-identical re-parse, breaking cache lookups for that
sub-tree. This patch closes that gap so `NormalizeCTEIds` produces a
parse-stable canonical form for both wrapped and orphan CTE references.
This fix is scoped to the `NormalizeCTEIds` contract; it does not by itself
remove all sources of canonical instability across parses (e.g. operand
ordering of commutative expressions or canonicalization of subquery
broadcast/dynamic-pruning wrappers), which are tracked separately.
### Does this PR introduce _any_ user-facing change?
No. `NormalizeCTEIds` is an internal plan-normalization rule. Plans without
orphan `CTERelationRef` produce byte-identical output to before; plans with
orphan refs only see their `cteId` rewritten to a deterministic value.
### How was this patch tested?
New `NormalizeCTEIdsSuite` with two unit tests:
- existing behaviour: `WithCTE`-wrapped plans canonicalize identically
across parses (regression guard for the unchanged path).
- new behaviour: orphan `CTERelationRef` sub-trees with structurally
identical bodies but different per-parse `cteId` values canonicalize
identically and are considered `sameResult` after `NormalizeCTEIds`.
New `CacheManagerSuite` test that parses a `WITH inner_cte AS ..., outer_cte
AS (... (SELECT m FROM inner_cte))` SQL twice, extracts `outer_cte`'s body (an
orphan `CTERelationRef` sub-tree) and verifies `QueryExecution.normalize`
produces the same canonical form / `sameResult` across parses.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI, model: Claude Opus 4.7 1M context
(claude-opus-4.7-1m-internal)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]