Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

via GitHub Sun, 12 Nov 2023 23:02:55 -0800


yaooqinn commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1390683182



##########
core/src/main/scala/org/apache/spark/scheduler/cluster/SchedulerBackendUtils.scala:
##########
@@ -44,4 +44,12 @@ private[spark] object SchedulerBackendUtils {
       conf.get(EXECUTOR_INSTANCES).getOrElse(numExecutors)
     }
   }
+
+  def getMaxTargetExecutorNumber(conf: SparkConf): Int = {

Review Comment:
   Hmm, the above code is unreliable. For instance, a non-streaming application 
configured with DYN_ALLOCATION_MAX_EXECUTORS by a user run on a system 
configuration copy with STREAMING_DYN_ALLOCATION_MAX_EXECUTORS. This surprises 
the users.
   
   Anyway, I will follow the code above.
   



##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala:
##########
@@ -142,7 +142,8 @@ class ExecutorPodsAllocator(
     }
     snapshotsStore.addSubscriber(podAllocationDelay) { executorPodsSnapshot =>
       onNewSnapshots(applicationId, schedulerBackend, executorPodsSnapshot)
-      if (failureTracker.numFailedExecutors > maxNumExecutorFailures) {
+      if (getNumExecutorsFailed > maxNumExecutorFailures &&
+          schedulerBackend.insufficientResourcesRetained()) {
         logError(s"Max number of executor failures ($maxNumExecutorFailures) 
reached")

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

Reply via email to