Alibi Yeslambek created SPARK-44609:
---------------------------------------

             Summary: ExecutorPodsAllocator doesn't create new executors if no 
pod snapshot captured pod creation
                 Key: SPARK-44609
                 URL: https://issues.apache.org/jira/browse/SPARK-44609
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 3.4.1
            Reporter: Alibi Yeslambek


There’s a following race condition in ExecutorPodsAllocator when running a 
spark application with static allocation on kubernetes with numExecutors >= 1:
 * Driver requests an executor
 * exec-1 gets created and registers with driver
 * exec-1 is moved from {{newlyCreatedExecutors}} to 
{{schedulerKnownNewlyCreatedExecs}}
 * exec-1 got deleted very quickly (~1-30 sec) after registration
 * {{ExecutorPodsWatchSnapshotSource}} fails to catch the creation of the pod 
(e.g. websocket connection was reset, k8s-apiserver was down, etc.)
 * {{ExecutorPodsPollingSnapshotSource}} fails to catch the creation because it 
runs every 30 secs, but executor was removed much quicker after creation
 * exec-1 is never removed from {{schedulerKnownNewlyCreatedExecs}}
 * {{ExecutorPodsAllocator}} will never request new executor because it’s slot 
is occupied by exec-1, due to {{schedulerKnownNewlyCreatedExecs}} never being 
cleared.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to