Parth Chandra created SPARK-55075:
-------------------------------------
Summary: Fail application after too many executor pod creation
errors
Key: SPARK-55075
URL: https://issues.apache.org/jira/browse/SPARK-55075
Project: Spark
Issue Type: Improvement
Components: Kubernetes
Affects Versions: 4.1.1
Reporter: Parth Chandra
On K8s, we may encounter failures in creating pods which are frequently
recoverable errors, but sometimes may be unrecoverable.
Spark currently handles pod creation failures as recoverable and will keep
attempting to create new executors which is not ideal if the failure cannot be
recovered from.
Additionally, in some cases, in some cases the same error may be either
recoverable or unrecoverable. For instance an ImagePullErr may be caused by a
temporary error in accessing the container registry (recoverable) or may be
because of a missing image (unrecoverable).
One way to handle this is to retry pod creation with backoff when we do get to
know that pod creation has failed. In addition, if the number of failures
exceeds a threshold then fail the application so that the user knows that there
is a problem in the cluster.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]