anishshri-db opened a new pull request, #42504:
URL: https://github.com/apache/spark/pull/42504
### What changes were proposed in this pull request?
Fix race for pending interrupt issued before taskThread is initialized
### Why are the changes needed?
We see that there is a race for tasks that are interrupted through stage
cancellation and that may be added to the TaskSet, but don't yet have
taskThread initialized.
Basically, we try to kill ongoing task attempts to handle stage cancellation
```
logInfo("Cancelling stage " + stageId)
// Kill all running tasks for the stage.
killAllTaskAttempts(stageId, interruptThread, reason = "Stage cancelled:
" + reason)
// Cancel all attempts for the stage.
```
However, there is a chance that taskThread is not initialized yet and we
only set the reasonIfKilled.
```
def kill(interruptThread: Boolean, reason: String): Unit = {
require(reason != null)
_reasonIfKilled = reason
if (context != null) {
context.markInterrupted(reason)
}
if (interruptThread && taskThread != null) {
taskThread.interrupt(). <--- never hit
}
```
Then within the task execution thread itself, we try to call kill again
since the reasonIfKilled is set. However, this time we pass interruptThread as
false explicitly since we don't know the status of the previous call.
```
taskThread = Thread.currentThread()
if (_reasonIfKilled != null) {
kill(interruptThread = false, _reasonIfKilled) <-- only context will
be set,
}
```
The TaskReaper has also finished its previous and only attempt at task
interruption since we don't try for multiple times in this case. Eventually,
the task is not interrupted even once and it gets blocked on some I/O or wait
calls which might not finish within the reaper timeout, leading to the JVM
being killed.
```
taskRunner.kill(interruptThread = interruptThread, reason = reason)
```
The change tries to fix this issue by trying to explicitly interrupt the
thread in case a previous requested interrupt was not serviced.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing unit tests
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]