[
https://issues.apache.org/jira/browse/SPARK-44609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-44609:
-----------------------------------
Labels: pull-request-available (was: )
> ExecutorPodsAllocator doesn't create new executors if no pod snapshot
> captured pod creation
> -------------------------------------------------------------------------------------------
>
> Key: SPARK-44609
> URL: https://issues.apache.org/jira/browse/SPARK-44609
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes, Scheduler
> Affects Versions: 3.4.1
> Reporter: Alibi Yeslambek
> Priority: Major
> Labels: pull-request-available
>
> There’s a following race condition in ExecutorPodsAllocator when running a
> spark application with static allocation on kubernetes with numExecutors >= 1:
> * Driver requests an executor
> * exec-1 gets created and registers with driver
> * exec-1 is moved from {{newlyCreatedExecutors}} to
> {{schedulerKnownNewlyCreatedExecs}}
> * exec-1 got deleted very quickly (~1-30 sec) after registration
> * {{ExecutorPodsWatchSnapshotSource}} fails to catch the creation of the pod
> (e.g. websocket connection was reset, k8s-apiserver was down, etc.)
> * {{ExecutorPodsPollingSnapshotSource}} fails to catch the creation because
> it runs every 30 secs, but executor was removed much quicker after creation
> * exec-1 is never removed from {{schedulerKnownNewlyCreatedExecs}}
> * {{ExecutorPodsAllocator}} will never request new executor because it’s
> slot is occupied by exec-1, due to {{schedulerKnownNewlyCreatedExecs}} never
> being cleared.
>
> Put up a fix here https://github.com/apache/spark/pull/42297
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]