[ https://issues.apache.org/jira/browse/SPARK-24221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24221. ---------------------------------- Resolution: Incomplete > Retry spark app submission to k8 in KubernetesClientApplication > --------------------------------------------------------------- > > Key: SPARK-24221 > URL: https://issues.apache.org/jira/browse/SPARK-24221 > Project: Spark > Issue Type: Improvement > Components: Kubernetes > Affects Versions: 2.3.0 > Reporter: Yifei Huang > Priority: Major > Labels: bulk-closed > > Following from https://issues.apache.org/jira/browse/SPARK-24135, drivers, in > addition to executors, could suffer from init-container failures in > Kubernetes. Currently, we fail the entire application if that's the case, so > it's up to the client to detect those errors and retry. However, since both > driver and executor initialization have the same failure case, it seems like > we're repeating logic in two places. Would it be better to consolidate this > retry logic in `KubernetesClientApplication`? > We could still count executor pod initialization failures in > `KubernetesClusterSchedulerBackend` and decide what to do with the > application if there are too many failures there, but we'd be guaranteed a > set number of retries for each executor before giving up. Or would this be > too confusing and obfuscate the true number of retries? We could also > configure the number of driver and executor retries separately. It just seems > like if we're tackling init-container failure retries for executors, we > should also provide support for drivers as well since they suffer from the > same problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org