[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-24135:
------------------------------------
Assignee: (was: Apache Spark)
> [K8s] Executors that fail to start up because of init-container errors are
> not retried and limit the executor pool size
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 2.3.0
> Reporter: Matt Cheah
> Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states.
> When executors fail in these ways, they are removed from the pending
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod
> enters the {{Init:Error}} state. This state comes up when the executor fails
> to launch because one of its init-containers fails. Spark itself doesn't
> attach any init-containers to the executors. However, custom web hooks can
> run on the cluster and attach init-containers to the executor pods.
> Additionally, pod presets can specify init containers to run on these pods.
> Therefore Spark should be handling the {{Init:Error}} cases regardless if
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the
> failed executor will never start, but it's still seen as pending by the
> executor allocator. The executor allocator won't request more rounds of
> executors because its current batch hasn't been resolved to either running or
> failed. Therefore we end up with being stuck with the number of executors
> that successfully started before the faulty one failed to start, potentially
> creating a fake resource bottleneck.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]