Alibi Yeslambek created SPARK-44609:
---------------------------------------
Summary: ExecutorPodsAllocator doesn't create new executors if no
pod snapshot captured pod creation
Key: SPARK-44609
URL: https://issues.apache.org/jira/browse/SPARK-44609
Project: Spark
Issue Type: Bug
Components: Scheduler
Affects Versions: 3.4.1
Reporter: Alibi Yeslambek
There’s a following race condition in ExecutorPodsAllocator when running a
spark application with static allocation on kubernetes with numExecutors >= 1:
* Driver requests an executor
* exec-1 gets created and registers with driver
* exec-1 is moved from {{newlyCreatedExecutors}} to
{{schedulerKnownNewlyCreatedExecs}}
* exec-1 got deleted very quickly (~1-30 sec) after registration
* {{ExecutorPodsWatchSnapshotSource}} fails to catch the creation of the pod
(e.g. websocket connection was reset, k8s-apiserver was down, etc.)
* {{ExecutorPodsPollingSnapshotSource}} fails to catch the creation because it
runs every 30 secs, but executor was removed much quicker after creation
* exec-1 is never removed from {{schedulerKnownNewlyCreatedExecs}}
* {{ExecutorPodsAllocator}} will never request new executor because it’s slot
is occupied by exec-1, due to {{schedulerKnownNewlyCreatedExecs}} never being
cleared.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]