Yifei Huang created SPARK-24221:
-----------------------------------
Summary: Retry spark app submission to k8 in
KubernetesClientApplication
Key: SPARK-24221
URL: https://issues.apache.org/jira/browse/SPARK-24221
Project: Spark
Issue Type: Improvement
Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Yifei Huang
Following from https://issues.apache.org/jira/browse/SPARK-24135, drivers, in
addition to executors, could suffer from init-container failures in Kubernetes.
Currently, we fail the entire application if that's the case, so it's up to the
client to detect those errors and retry. However, since both driver and
executor initialization have the same failure case, it seems like we're
repeating logic in two places. Would it be better to consolidate this retry
logic in `KubernetesClientApplication`?
We could still count executor pod initialization failures in
`KubernetesClusterSchedulerBackend` and decide what to do with the application
if there are too many failures there, but we'd be guaranteed a set number of
retries for each executor before giving up. Or would this be too confusing and
obfuscate the true number of retries? We could also configure the number of
driver and executor retries separately. It just seems like if we're tackling
init-container failure retries for executors, we should also provide support
for drivers as well since they suffer from the same problem.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]