[
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461024#comment-16461024
]
Matt Cheah commented on SPARK-24135:
------------------------------------
I think we should not count these towards job failures, and that we should
always retry asking for executors whenever these kinds of errors do occur. The
reason for this is because Spark is acting much like a Kubernetes controller.
Kubernetes controllers handle retries both in the main logic and in
initialization errors. When we get an initialization error, we should retry
indefinitely. We're using the low level Pod API which is in and of itself not
common, because most users of Kubernetes will instead use the higher order
primitives in jobs, stateful sets, deployments, etc. Many clusters with their
initializers are configured under the assumption that the Kubernetes
controllers will handle retrying the execution of tasks. Therefore it is
idiomatic to Kubernetes to make Spark consistent with these higher order
primitives; it brings Spark into closer consistency with everything else
running in the cluster. We could have a fixed number of these retries, but that
too would be inconsistent with the behavior of the other controllers.
In the case that there is a bug in the initialization code itself, and that bug
is completely deterministic or an external component is down - then the Spark
job is never going to make progress anyways, because every executor pod that is
launched will be unable to start running at all. At that point it would be
obvious to the user that their job isn't working well and they can start
discussions with the cluster admin to troubleshoot the issue.
If we're of the opinion that Spark behaves much like a custom controller, an
argument can even be made that we should always retry running executors every
time they fail at all - regardless of the class of error that comes up. This is
once again equivalent behavior to replication controllers and deployments.
A minor side note - all of the above assumes static allocation, where the
application always wants a specific number of executors to be running at all
times if cluster resources allow. In dynamic allocation, you can afford to lose
some executors and one would still be satisfying the contract of "at least X
executors are up".
> [K8s] Executors that fail to start up because of init-container errors are
> not retried and limit the executor pool size
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
> Issue Type: Bug
> Components: Kubernetes
> Affects Versions: 2.3.0
> Reporter: Matt Cheah
> Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states.
> When executors fail in these ways, they are removed from the pending
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod
> enters the {{Init:Error}} state. This state comes up when the executor fails
> to launch because one of its init-containers fails. Spark itself doesn't
> attach any init-containers to the executors. However, custom web hooks can
> run on the cluster and attach init-containers to the executor pods.
> Additionally, pod presets can specify init containers to run on these pods.
> Therefore Spark should be handling the {{Init:Error}} cases regardless if
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the
> failed executor will never start, but it's still seen as pending by the
> executor allocator. The executor allocator won't request more rounds of
> executors because its current batch hasn't been resolved to either running or
> failed. Therefore we end up with being stuck with the number of executors
> that successfully started before the faulty one failed to start, potentially
> creating a fake resource bottleneck.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]