Parth Chandra created SPARK-55075:
-------------------------------------

             Summary: Fail application after too many executor pod creation 
errors
                 Key: SPARK-55075
                 URL: https://issues.apache.org/jira/browse/SPARK-55075
             Project: Spark
          Issue Type: Improvement
          Components: Kubernetes
    Affects Versions: 4.1.1
            Reporter: Parth Chandra


On K8s, we may encounter failures in creating pods which are frequently 
recoverable errors, but sometimes may be unrecoverable.
Spark currently handles pod creation failures as recoverable and will keep 
attempting to create new executors which is not ideal if the failure cannot be 
recovered from.

Additionally, in some cases, in some cases the same error may be either 
recoverable or unrecoverable. For instance an ImagePullErr may be caused by a 
temporary error in accessing the container registry (recoverable) or may be 
because of a missing image (unrecoverable). 
One way to handle this is to retry pod creation with backoff when we do get to 
know that pod creation has failed. In addition, if the number of failures 
exceeds a threshold then fail the application so that the user knows that there 
is a problem in the cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to