gengliangwang opened a new pull request, #56252:
URL: https://github.com/apache/spark/pull/56252

   ### What changes were proposed in this pull request?
   
   `UnionExec` whole-stage codegen fusion (SPARK-56482) kept per-emission 
codegen state in mutable instance fields on the plan node: 
`currentEmittingChild` (set in `doProduce`, read in `doConsume` to pick a 
child's projection) and `numOutputRowsTerm` (the once-per-stage metric term). 
This PR moves both fields to `ThreadLocal`, isolating the state to the single 
thread that runs a given `doCodeGen` pass.
   
   ### Why are the changes needed?
   
   A single `UnionExec` instance can have its whole-stage codegen driven by 
more than one thread at the same time: a reused exchange/subquery stage is 
generated concurrently with the main plan, and async subquery / 
dynamic-partition-pruning execution can overlap a driver-side `doCodeGen`. With 
the shared mutable field, a racing `doProduce` resets `currentEmittingChild` to 
`-1` while another thread is still inside `doConsume`, tripping:
   
   ```
   java.lang.IllegalArgumentException: requirement failed:
     UnionExec.doConsume invoked outside doProduce emission window
   ```
   
   This surfaced as a flaky `LogicalPlanTagInSparkPlanSuite.q2` failure (q2 
contains a `UNION`, and union fusion is enabled by default). Each `doCodeGen` 
pass is itself single-threaded (`produce` -> `doConsume` run inline on one 
thread), so a `ThreadLocal` isolates the state per pass without the 
cross-thread race, while preserving the existing per-stage semantics (the 
metric term is still computed once per pass).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. It removes an intermittent internal code-generation failure; the 
generated code and query results are unchanged.
   
   ### How was this patch tested?
   
   Added a `UnionCodegenSuite` test, "SPARK-57196: concurrent codegen of a 
shared UnionExec stage is thread-safe", that drives `doCodeGen()` on one shared 
fused `UnionExec` stage from 8 threads. It reproduces the "outside doProduce 
emission window" failure on the unpatched code and passes with this fix. Also 
verified the full `UnionCodegenSuite` (43 tests), its ANSI/AQE variants, and 
`LogicalPlanTagInSparkPlanSuite` q2 all pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Opus 4.8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to