juliuszsompolski opened a new pull request, #44095:
URL: https://github.com/apache/spark/pull/44095
### What changes were proposed in this pull request?
A race condition sending interrupt (or releaseSession) just after execute
could cause:
```
[info] org.apache.spark.SparkException: java.lang.IllegalStateException:
[info] operationId: fa653330-587a-41c7-a4a9-7987f070dcba with status
Started
[info] is not within statuses List(Finished, Failed, Canceled) for
event Closed
```
and
```
[info] org.apache.spark.SparkException: java.lang.IllegalStateException:
[info] operationId: fa653330-587a-41c7-a4a9-7987f070dcba with status
Closed
[info] is not within statuses List(Started, Analyzed,
ReadyForExecution, Finished, Failed) for event Canceled
```
when the interrupt arrives before the thread in ExecuteThreadRunner is
started.
This would cause in ExecuteHolder close:
```
runner.interrupt() <- just sets interrupted = true
runner.join() <- thread didn't start yet, so join() returns
immediately, doesn't wait for thread to be actually interrupted
...
eventsManager.postClosed() <- causes the first failure, because thread
wasn't running and didn't move to Canceled
```
Afterwards, assuming we allowed the transition, the thread will get started,
and then want to immediately move to Canceled, notice the `interrupted` state.
Then it would hit the 2nd error, not allowing Canceled after Closed.
While we could consider allowing the first transition (Started -> Closed),
we don't want any events to be coming after Closed, so that listeners can clean
their state after Closed.
Fix is to handle interrupts coming before the thread started, and then
prevent the thread from even starting if it was interruped.
### Why are the changes needed?
This was detected after grpc 1.56 to 1.59 upgrade and causes some tests in
SparkConnectServiceE2ESuite and ReattachableExecuteSuite to be flaky.
With the grpc upgrade, execute is eagerly sent to the server, and in some
test we cleanup and release the session without waiting for the execution to
start. This has triggered this flakiness.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Test added. The test depends on timing, so may not fail reliably but only
from time to time.
### Was this patch authored or co-authored using generative AI tooling?
Github Copilot was assisting in some boilerplate auto-completion.
Generated-by: Github Copilot
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]