Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21241#discussion_r186790719
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala
---
@@ -320,50 +322,83 @@ private[spark] class
KubernetesClusterSchedulerBackend(
override def eventReceived(action: Action, pod: Pod): Unit = {
val podName = pod.getMetadata.getName
val podIP = pod.getStatus.getPodIP
-
+ val podPhase = pod.getStatus.getPhase
action match {
- case Action.MODIFIED if (pod.getStatus.getPhase == "Running"
+ case Action.MODIFIED if (podPhase == "Running"
&& pod.getMetadata.getDeletionTimestamp == null) =>
val clusterNodeName = pod.getSpec.getNodeName
logInfo(s"Executor pod $podName ready, launched at
$clusterNodeName as IP $podIP.")
executorPodsByIPs.put(podIP, pod)
- case Action.DELETED | Action.ERROR =>
+ case Action.MODIFIED if (podPhase == "Init:Error" || podPhase ==
"Init:CrashLoopBackoff")
+ && pod.getMetadata.getDeletionTimestamp == null =>
val executorId = getExecutorId(pod)
- logDebug(s"Executor pod $podName at IP $podIP was at $action.")
- if (podIP != null) {
- executorPodsByIPs.remove(podIP)
+ failedInitExecutors.add(executorId)
+ if (failedInitExecutors.size >= executorMaxInitErrors) {
+ val errorMessage = s"Aborting Spark application because
$executorMaxInitErrors" +
+ s" executors failed to start. The maximum number of allowed
startup failures is" +
+ s" $executorMaxInitErrors. Please contact your cluster
administrator or increase" +
+ s" your setting of
${KUBERNETES_EXECUTOR_MAX_INIT_ERRORS.key}."
+ logError(errorMessage)
+
KubernetesClusterSchedulerBackend.this.scheduler.sc.stopInNewThread()
--- End diff --
The problem is that executors that fail to start up won't create task
failures, meaning that they will not cause the job(s) to stop. Additionally,
executors failing to start up will affect every job on the process. So what I
think we want to do is when we fail with these initialization errors too many
times, is to shut down the Spark context and crash all of the jobs. The
alternative is that we stop requesting new executors entirely, but this state
isn't desirable because for example you could have your Spark application
running with 0 executors, meaning that your application can be stuck for
seemingly no reason.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]