HyukjinKwon opened a new pull request, #56716:
URL: https://github.com/apache/spark/pull/56716
> **[DO-NOT-MERGE]** — draft used to stabilize a flaky CI test and validate
the fix
> on a fork. The last commit adds temporary CI scaffolding (a focused
workflow that
> re-runs only `SparkSessionE2ESuite` several times, and skips the full
scala matrix
> for the fork branch) and **must be dropped before any real merge**.
### What changes were proposed in this pull request?
Make the two `SparkSessionE2ESuite` "interrupt all" tests robust against two
flakiness sources:
1. **Class-fetch race.** Run each long-running typed `map` query through a
single call site and
warm it up once (`sleep=0`) before any interrupt. The first execution of
a typed `map` ships
the closure and its `TypeTag` artifact classes, and the executor fetches
them on demand. When
an `interruptAll()` lands during that first-time remote class fetch, it
surfaces as
`RemoteClassLoaderError`
(`...SparkSessionE2ESuite$$typecreatorNN$1.class`) instead of
`OPERATION_CANCELED`, failing the assertion. Warming up loads those
classes on the executor so
the interrupted run no longer races a class fetch.
2. **Leaked interruptor / cascade.** Wrap the foreground-interrupt test body
in
`try/finally { finished = true }`. Previously, if an assertion failed,
the background
`interruptor` Future kept calling `interruptAll()` for up to 20s and
canceled the operations of
*subsequent* tests in the suite — turning one failure into a cascade of
`OPERATION_CANCELED`
failures across the whole suite.
### Why are the changes needed?
`SparkSessionE2ESuite` intermittently fails in master push Build (SBT) and
Maven (Scala 2.13,
JDK 21/25): a single `RemoteClassLoaderError` in `interrupt all - foreground
queries, background
interrupt` cascaded into ~7 failures in the suite. Confirmed flaky (the same
module group passes
on other runs of the same commit).
### Does this PR introduce any user-facing change?
No, test-only.
### How was this patch tested?
Re-ran `SparkSessionE2ESuite` repeatedly via a focused fork workflow (8
iterations) plus the full
connect module once. (CI scaffolding commit is `[DO-NOT-MERGE]` and will be
removed.)
### Was this patch authored or co-authored using generative AI tooling?
Yes, drafted with Claude Code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]