peter-toth opened a new pull request, #52899:
URL: https://github.com/apache/spark/pull/52899
### What changes were proposed in this pull request?
When `ExecutorPodsLifecycleManager` processes the sequence of snapshots in
`onNewSnapshots()` it maintains the `execIdsRemovedInThisRound` set of executor
ids to not try deleting an executor pod multiple times.
But this logic seems to have a flaw because it depends on
`onFinalNonDeletedState()`, which depends on `removeExecutorFromSpark()`, which
depends on if the executor has been added to `removedExecutorsCache`.
Consider the following scenario:
1. `onNewSnapshots()` runs and an executor is selected for deletion due to a
`PodFailed` or `PodSucceeded` event.
Because `removedExecutorsCache` doesn't contain the executor
`onFinalNonDeletedState()` returns with true and the executor is added to
`execIdsRemovedInThisRound`.
Before adding the executor to `execIdsRemovedInThisRound`
`onFinalNonDeletedState()` calls `removeExecutorFromK8s()` to delete the
Kubernetes pod.
Due to the executor is in `execIdsRemovedInThisRound` the pod deletion is
tried only once regardless how many snapshots we process in `onNewSnapshots()`.
2. Let's suppose the pod deletion failed and
`spark.kubernetes.executor.eventProcessingInterval` later (1s by default)
`onNewSnapshots()` runs again.
Because the executor is already in `removedExecutorsCache`, it is never
added to `execIdsRemovedInThisRound`, which results in trying to delete the pod
as many times as the number of snapshots we process in `onNewSnapshots()`.
In our case the pod initial deletion failed due to flooding the kubernetes
API so we issued more and more deletes...
### Why are the changes needed?
Fix the above scenario.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
TODO.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]