alibiyeslambek commented on code in PR #42297:
URL: https://github.com/apache/spark/pull/42297#discussion_r1388358193


##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala:
##########
@@ -190,6 +190,13 @@ class ExecutorPodsAllocator(
       
newlyCreatedExecutors.filterKeys(schedulerKnownExecs.contains(_)).mapValues(_._1)
     newlyCreatedExecutors --= schedulerKnownNewlyCreatedExecs.keySet
 
+    // If executor was created and removed in a short period, then it is 
possible that the creation

Review Comment:
   Hey @dongjoon-hyun, sorry for the late reply. Here's an example of a failure 
scenario:
   
   - Driver requests an executor
   - exec-1 gets created and registers with driver
   - exec-1 is moved from newlyCreatedExecutors to 
schedulerKnownNewlyCreatedExecs
   - exec-1 got deleted very quickly (~1-30 sec) after registration
   - ExecutorPodsWatchSnapshotSource fails to catch the creation of the pod 
(e.g. websocket connection was reset, k8s-apiserver was down, etc.)
   - ExecutorPodsPollingSnapshotSource fails to catch the creation because it 
runs every 30 secs, but executor was removed much quicker after creation
   - exec-1 is never removed from schedulerKnownNewlyCreatedExecs
   - ExecutorPodsAllocator will never request new executor because it’s slot is 
occupied by exec-1, due to schedulerKnownNewlyCreatedExecs never being cleared.



##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala:
##########
@@ -190,6 +190,13 @@ class ExecutorPodsAllocator(
       
newlyCreatedExecutors.filterKeys(schedulerKnownExecs.contains(_)).mapValues(_._1)
     newlyCreatedExecutors --= schedulerKnownNewlyCreatedExecs.keySet
 
+    // If executor was created and removed in a short period, then it is 
possible that the creation

Review Comment:
   Hey @dongjoon-hyun, sorry for the late reply. Here's an example of a failure 
scenario:
   
   - Driver requests an executor
   - exec-1 gets created and registers with driver
   - exec-1 is moved from newlyCreatedExecutors to 
schedulerKnownNewlyCreatedExecs
   - exec-1 got deleted very quickly (~1-30 sec) after registration
   - ExecutorPodsWatchSnapshotSource fails to catch the creation of the pod 
(e.g. websocket connection was reset, k8s-apiserver was down, etc.)
   - ExecutorPodsPollingSnapshotSource fails to catch the creation because it 
runs every 30 secs, but executor was removed much quicker after creation
   - exec-1 is never removed from schedulerKnownNewlyCreatedExecs
   - ExecutorPodsAllocator will never request new executor because it’s slot is 
occupied by exec-1, due to schedulerKnownNewlyCreatedExecs never being cleared.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to