[ https://issues.apache.org/jira/browse/SPARK-52589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18010226#comment-18010226 ]
Dongjoon Hyun commented on SPARK-52589: --------------------------------------- Thank you for reporting, [~rcheng] . > Spark 3.5.5 on Kubernetes with Dynamic Allocation Hangs (minExecutors=0) Due > to Stale Ghost Executors > ----------------------------------------------------------------------------------------------------- > > Key: SPARK-52589 > URL: https://issues.apache.org/jira/browse/SPARK-52589 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core > Affects Versions: 3.5.5 > Reporter: Richard Cheng > Priority: Major > > _TLDR: After upgrading from Spark 3.5.1 to 3.5.5, we observe that jobs using > dynamic allocation on Kubernetes with > {{spark.dynamicAllocation.minExecutors=0}} intermittently hang: when all > executors are decommissioned, no new executors are allocated even if there is > pending work. Reverting to 3.5.1 resolves the issue. This appears to be a > severe regression._ > *Repro Steps* > # Use Spark 3.5.5 on Kubernetes with dynamic allocation enabled and > {{{}spark.dynamicAllocation.minExecutors=0{}}}. > # Submit a job that causes all executors to be decommissioned (e.g., idle > period). > # Submit more work to the same job/application. > # Observe that no new executors are allocated, and the job occasionally > hangs with pending work. > # Reverting to 3.5.1, the same job works as expected. > Following a bump from 3.5.1 to 3.5.5, we started observing that when running > Spark on Kubernetes with dynamic allocation and > {{{}spark.dynamicAllocation.minExecutors=0{}}}, executor > removal/decommissioning can sometimes leave the system in a state where no > new executors are allocated, even though there is pending work. Our > hypothesis is that there's a race or missed event between the Kubernetes pod > watcher, the internal executor state in the driver, and > {{{}ExecutorPodsAllocator{}}}, resulting in a "ghost executor" that prevents > further scaling. > *Summary of our Investigation (via logging > [ExecutorPodsAllocator|https://github.com/apache/spark/blob/v3.5.5/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala]):* > # We noticed that we never receive a [log from > {{ExecutorPodsAllocator}}|https://github.com/apache/spark/blob/v3.5.5/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L397C9-L399C69] > confirming we make a request to Kubernetes for another executor. We added > logs to the conditional that triggers that code block, and we found that > {{podsToAllocateWithRpId.size =}} 0, which is why we never request an > additional executor. > # Next, we checked why the podsToAllocateWithRpId map is empty – we see that > [it only adds an element if (podCountForRpId < > targetNum)|https://github.com/apache/spark/blob/v3.5.5/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L378]. > We added logs, and we confirmed that even though {{podCountForRpId}} should > equal 0 when the only previous executor is decommissioned (and thus be less > than {{targetNum}} = 1), {{podCountForRpId}} is still equal to 1. > # We investigated why {{podCountForRpId}} == 1, and we found that > [{{schedulerKnownNewlyCreatedExecsForRpId}}|https://github.com/apache/spark/blob/v3.5.5/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L326] > still contains an entry for the previously created (and now decommissioned) > executor. Logging confirmed that it is *not* because > {{schedulerBackend.getExecutorIds()}} still contains an entry for a deleted > executor post-decommission; the scheduler definitely removed the executor > properly. Instead, we are finding that the culprit is > [here|https://github.com/apache/spark/blob/v3.5.5/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L176]: > {code:java} > val k8sKnownExecIds = snapshots.flatMap(_.executorPods.keys).distinct > newlyCreatedExecutors --= k8sKnownExecIds > schedulerKnownNewlyCreatedExecs --= k8sKnownExecIds{code} > We are logging that {{k8sKnownExecIds}} is always empty each > {{onNewSnapshots()}} cycle throughout the entire Spark job. The intended > behavior is for the original executor to have appeared in a snapshot and be > removed from {{schedulerKnownNewlyCreatedExecs}} – that never happens in > these hanging jobs, and there is no timeout logic to clear executors from > {{{}schedulerKnownNewlyCreatedExecs{}}}. The pod allocator permanently is > waiting for a newly created executor from the scheduler to appear in a > snapshot. > We can confirm that reverting to 3.5.1 stops these intermittent hanging jobs. > We've looked through the commits between these two versions, and we are > stumped because we don't see obviously related changes in dynamic allocation > or pod snapshot logic. We really need to solve these hangs ASAP – any > guidance would be appreciated! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org