cloud-fan opened a new pull request, #56411: URL: https://github.com/apache/spark/pull/56411
### What changes were proposed in this pull request? Make the `KeyedPartitioning` invariant enforcement added in #55901 (SPARK-56877) cheap for deeply nested `PartitioningCollection`s: - `checkKeyedPartitioningInvariant()` no longer walks the entire partitioning tree with `TreeNode.foreach` on every construction. Each collection now exposes a cached representative (`firstKeyedPartitioning`); since every nested collection already enforced the invariant on its own construction, comparing one representative per direct member validates the whole subtree by induction. The check is now O(partitionings.size) instead of O(subtree size). - `PartitioningCollection.fromPartitionings` no longer rebuilds nested collections that are already consistent. Using the same representative, a nested collection with no `KeyedPartitioning` in its subtree, or whose canonical `partitionKeys` reference already matches, is returned as-is in O(1). It only recurses and rebuilds when interning is actually needed. The invariant itself is unchanged and still enforced in the constructor. ### Why are the changes needed? SPARK-56877 caused a planning-time regression for plans that chain many shuffle joins on the same key. For an inner join, `ShuffledJoin.outputPartitioning` wraps the children's partitionings in a `PartitioningCollection`, so a chain of N same-key joins nests collections N levels deep, and `outputPartitioning` is a `def` recomputed on every access. With the constructor running a full-subtree walk and `fromPartitionings` rebuilding every nested level (re-triggering the walk at each level), a single `outputPartitioning` evaluation at depth N went from O(N) to O(N^3). On a benchmark query chaining 125 shuffle hash joins, `EnsureRequirements`-phase planning time grew ~9x (312 ms to 2779 ms), regressing end-to-end query time by ~33%. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests in `DistributionSuite`: - `fromPartitionings` returns already-consistent nested collections reference-equal (guards against reintroducing the rebuild), - `fromPartitionings` still interns structurally-equal-but-reference-distinct `partitionKeys` across nesting, - the constructor still rejects `partitionKeys` reference mismatches and expression arity mismatches through nesting. Existing suites covering the invariant and interning behavior pass: `DistributionSuite`, `EnsureRequirementsSuite`, `ProjectedOrderingAndPartitioningSuite`, `GroupPartitionsExecSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Fable 5) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
