gengliangwang opened a new pull request, #55889:
URL: https://github.com/apache/spark/pull/55889
### What changes were proposed in this pull request?
Gate `CurrentOrigin.withOrigin(origin) { rule.applyOrElse(...) }` in four
`TreeNode` rule-driven transform methods behind `rule.isDefinedAt(...)`, so the
ThreadLocal wrap only runs when the rule actually fires:
- `TreeNode.transformDownWithPruning`
- `TreeNode.transformUpWithPruning`
- `TreeNode.transformUpWithBeforeAndAfterRuleOnChildren`
- `TreeNode.multiTransformDownWithPruning` (also drops a side-effecting
default closure that became unnecessary)
### Why are the changes needed?
`CurrentOrigin.withOrigin` is only observable when the rule constructs new
nodes — they pick up `CurrentOrigin.get` in their `override val origin` field.
On nodes the rule doesn't match, the wrap is pure overhead: two `ThreadLocal`
writes plus a `try`/`finally` per node visit.
JFR profiling (60s sample, 1.1M iterations of `transformDown` over a
1024-leaf balanced `Add` tree with a non-matching rule) shows:
- 66% of CPU samples in `ThreadLocalMap.set` (line 486)
- 13% in `ThreadLocalMap.getEntryAfterMiss`
- 9% more in `ThreadLocalMap.set` (line 493)
Total: **~88% of transform CPU spent inside `CurrentOrigin.withOrigin` for
nodes the rule never matched.**
Microbenchmark (best time per N iterations, JDK 17, Xeon 8175M @ 2.50GHz,
baseline = `upstream/master`):
| case | baseline | optimized |
speedup |
|----------------------------------------------|----------:|----------:|--------:|
| `transformDown` deep chain(5000) no-op | 12 ms | 6 ms |
2.0x |
| `transformDown` deep chain(5000) rewrite leaf| 20109 ms | 15850 ms |
1.27x |
| `transformDown` wide(100) no-op | 5 ms | 3 ms |
1.7x |
| `transformDown` balanced(1024) no-op | 7 ms | 2 ms |
3.5x |
| `transformDown` balanced(4096) no-op | 34 ms | 15 ms |
2.3x |
| `transformUp` deep chain(1000) no-op | 3 ms | 1 ms |
3.0x |
| `transformUp` deep chain(5000) no-op | 15 ms | 6 ms |
2.5x |
| `transformUp` balanced(1024) no-op | 8 ms | 3 ms |
2.7x |
| `transformUp` balanced(4096) no-op | 39 ms | 25 ms |
1.6x |
Rewrite-heavy cases are unchanged because `withOrigin` still runs when the
rule fires. Real Spark workloads (analyzer / optimizer batches running many
rules across many nodes, where each rule matches a small subset) are dominated
by the no-match case, so these savings should compound.
### Does this PR introduce _any_ user-facing change?
No. The change is internal to `TreeNode` rule machinery and preserves all
observable semantics:
- `CurrentOrigin` is still set before any rule body that constructs new
nodes runs.
- `markRuleAsIneffective` / `isRuleIneffective` bookkeeping unchanged.
- `copyTagsFrom` ordering on rule-replacement unchanged.
- `fastEquals` short-circuit unchanged.
- Result identity (`this eq result` on a no-op transform) preserved.
### How was this patch tested?
- `build/sbt 'catalyst/testOnly *TreeNodeSuite'` — 36 / 36 pass.
- `build/sbt 'catalyst/test'` — 9272 / 9272 pass across 352 suites (~7 min).
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic Claude Opus 4.7)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]