[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461024#comment-16461024
 ] 

Matt Cheah commented on SPARK-24135:
------------------------------------

I think we should not count these towards job failures, and that we should 
always retry asking for executors whenever these kinds of errors do occur. The 
reason for this is because Spark is acting much like a Kubernetes controller. 
Kubernetes controllers handle retries both in the main logic and in 
initialization errors. When we get an initialization error, we should retry 
indefinitely. We're using the low level Pod API which is in and of itself not 
common, because most users of Kubernetes will instead use the higher order 
primitives in jobs, stateful sets, deployments, etc. Many clusters with their 
initializers are configured under the assumption that the Kubernetes 
controllers will handle retrying the execution of tasks. Therefore it is 
idiomatic to Kubernetes to make Spark consistent with these higher order 
primitives; it brings Spark into closer consistency with everything else 
running in the cluster. We could have a fixed number of these retries, but that 
too would be inconsistent with the behavior of the other controllers.

In the case that there is a bug in the initialization code itself, and that bug 
is completely deterministic or an external component is down - then the Spark 
job is never going to make progress anyways, because every executor pod that is 
launched will be unable to start running at all. At that point it would be 
obvious to the user that their job isn't working well and they can start 
discussions with the cluster admin to troubleshoot the issue.

If we're of the opinion that Spark behaves much like a custom controller, an 
argument can even be made that we should always retry running executors every 
time they fail at all - regardless of the class of error that comes up. This is 
once again equivalent behavior to replication controllers and deployments.

A minor side note - all of the above assumes static allocation, where the 
application always wants a specific number of executors to be running at all 
times if cluster resources allow. In dynamic allocation, you can afford to lose 
some executors and one would still be satisfying the contract of "at least X 
executors are up".

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24135
>                 URL: https://issues.apache.org/jira/browse/SPARK-24135
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes
>    Affects Versions: 2.3.0
>            Reporter: Matt Cheah
>            Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to