Yifei Huang created SPARK-24221:
-----------------------------------

             Summary: Retry spark app submission to k8 in 
KubernetesClientApplication
                 Key: SPARK-24221
                 URL: https://issues.apache.org/jira/browse/SPARK-24221
             Project: Spark
          Issue Type: Improvement
          Components: Kubernetes
    Affects Versions: 2.3.0
            Reporter: Yifei Huang


Following from https://issues.apache.org/jira/browse/SPARK-24135, drivers, in 
addition to executors, could suffer from init-container failures in Kubernetes. 
Currently, we fail the entire application if that's the case, so it's up to the 
client to detect those errors and retry. However, since both driver and 
executor initialization have the same failure case, it seems like we're 
repeating logic in two places. Would it be better to consolidate this retry 
logic in `KubernetesClientApplication`?

We could still count executor pod initialization failures in 
`KubernetesClusterSchedulerBackend` and decide what to do with the application 
if there are too many failures there, but we'd be guaranteed a set number of 
retries for each executor before giving up. Or would this be too confusing and 
obfuscate the true number of retries? We could also configure the number of 
driver and executor retries separately. It just seems like if we're tackling 
init-container failure retries for executors, we should also provide support 
for drivers as well since they suffer from the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to