HyukjinKwon opened a new pull request, #56716:
URL: https://github.com/apache/spark/pull/56716

   > **[DO-NOT-MERGE]** — draft used to stabilize a flaky CI test and validate 
the fix
   > on a fork. The last commit adds temporary CI scaffolding (a focused 
workflow that
   > re-runs only `SparkSessionE2ESuite` several times, and skips the full 
scala matrix
   > for the fork branch) and **must be dropped before any real merge**.
   
   ### What changes were proposed in this pull request?
   
   Make the two `SparkSessionE2ESuite` "interrupt all" tests robust against two 
flakiness sources:
   
   1. **Class-fetch race.** Run each long-running typed `map` query through a 
single call site and
      warm it up once (`sleep=0`) before any interrupt. The first execution of 
a typed `map` ships
      the closure and its `TypeTag` artifact classes, and the executor fetches 
them on demand. When
      an `interruptAll()` lands during that first-time remote class fetch, it 
surfaces as
      `RemoteClassLoaderError` 
(`...SparkSessionE2ESuite$$typecreatorNN$1.class`) instead of
      `OPERATION_CANCELED`, failing the assertion. Warming up loads those 
classes on the executor so
      the interrupted run no longer races a class fetch.
   
   2. **Leaked interruptor / cascade.** Wrap the foreground-interrupt test body 
in
      `try/finally { finished = true }`. Previously, if an assertion failed, 
the background
      `interruptor` Future kept calling `interruptAll()` for up to 20s and 
canceled the operations of
      *subsequent* tests in the suite — turning one failure into a cascade of 
`OPERATION_CANCELED`
      failures across the whole suite.
   
   ### Why are the changes needed?
   
   `SparkSessionE2ESuite` intermittently fails in master push Build (SBT) and 
Maven (Scala 2.13,
   JDK 21/25): a single `RemoteClassLoaderError` in `interrupt all - foreground 
queries, background
   interrupt` cascaded into ~7 failures in the suite. Confirmed flaky (the same 
module group passes
   on other runs of the same commit).
   
   ### Does this PR introduce any user-facing change?
   
   No, test-only.
   
   ### How was this patch tested?
   
   Re-ran `SparkSessionE2ESuite` repeatedly via a focused fork workflow (8 
iterations) plus the full
   connect module once. (CI scaffolding commit is `[DO-NOT-MERGE]` and will be 
removed.)
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, drafted with Claude Code.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to