alibiyeslambek commented on code in PR #42297:
URL: https://github.com/apache/spark/pull/42297#discussion_r1388358193
##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala:
##########
@@ -190,6 +190,13 @@ class ExecutorPodsAllocator(
newlyCreatedExecutors.filterKeys(schedulerKnownExecs.contains(_)).mapValues(_._1)
newlyCreatedExecutors --= schedulerKnownNewlyCreatedExecs.keySet
+ // If executor was created and removed in a short period, then it is
possible that the creation
Review Comment:
Hey @dongjoon-hyun, sorry for the late reply. Here's an example of a failure
scenario:
- Driver requests an executor
- exec-1 gets created and registers with driver
- exec-1 is moved from newlyCreatedExecutors to
schedulerKnownNewlyCreatedExecs
- exec-1 got deleted very quickly (~1-30 sec) after registration
- ExecutorPodsWatchSnapshotSource fails to catch the creation of the pod
(e.g. websocket connection was reset, k8s-apiserver was down, etc.)
- ExecutorPodsPollingSnapshotSource fails to catch the creation because it
runs every 30 secs, but executor was removed much quicker after creation
- exec-1 is never removed from schedulerKnownNewlyCreatedExecs
- ExecutorPodsAllocator will never request new executor because it’s slot is
occupied by exec-1, due to schedulerKnownNewlyCreatedExecs never being cleared.
##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala:
##########
@@ -190,6 +190,13 @@ class ExecutorPodsAllocator(
newlyCreatedExecutors.filterKeys(schedulerKnownExecs.contains(_)).mapValues(_._1)
newlyCreatedExecutors --= schedulerKnownNewlyCreatedExecs.keySet
+ // If executor was created and removed in a short period, then it is
possible that the creation
Review Comment:
Hey @dongjoon-hyun, sorry for the late reply. Here's an example of a failure
scenario:
- Driver requests an executor
- exec-1 gets created and registers with driver
- exec-1 is moved from newlyCreatedExecutors to
schedulerKnownNewlyCreatedExecs
- exec-1 got deleted very quickly (~1-30 sec) after registration
- ExecutorPodsWatchSnapshotSource fails to catch the creation of the pod
(e.g. websocket connection was reset, k8s-apiserver was down, etc.)
- ExecutorPodsPollingSnapshotSource fails to catch the creation because it
runs every 30 secs, but executor was removed much quicker after creation
- exec-1 is never removed from schedulerKnownNewlyCreatedExecs
- ExecutorPodsAllocator will never request new executor because it’s slot is
occupied by exec-1, due to schedulerKnownNewlyCreatedExecs never being cleared.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]