tgravescs commented on a change in pull request #30795:
URL: https://github.com/apache/spark/pull/30795#discussion_r545903441



##########
File path: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
##########
@@ -501,11 +501,16 @@ private[spark] class ExecutorAllocationManager(
    */
   private def addExecutors(maxNumExecutorsNeeded: Int, rpId: Int): Int = {
     val oldNumExecutorsTarget = numExecutorsTargetPerResourceProfileId(rpId)
+    // Increase the maxNumExecutors by adding the excluded executors so that 
manager can
+    // launch new executors to replace the excluded executors.
+    val exclude = executorMonitor.excludedExecutorCount
+    val maxOverheadExecutors = maxNumExecutors + exclude

Review comment:
       so I don't agree with this, at least not how its defined.  The user 
defined the maximum number of executors to use, this is getting more than that. 
I realize that some are excluded, but this also comes down to a resource 
utilization question as well. If I am in multi-tenant environment, I want to 
make sure 1 job doesn't take over the entire cluster. max is one way to do 
this.  I think we would either need to redefine this, which isn't great for 
backwards compatibility and could result in unexpected behavior or we add 
another config that is around the excluded nodes. this would either just be an 
allow to go over or a allow to go over by X. The downside to this is default 
would be 0 or false so you would have to configure if you do set max and want 
to use this feature. But I don't see a lot of jobs setting max unless they are 
trying to be nice in multi-tenant so it seems ok as long as its in release 
notes, etc.
   
     you will notice the other logic for unschedulableTaskSets does not 
increase this, just increases the number we ask for.

##########
File path: 
core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
##########
@@ -516,6 +573,8 @@ private[spark] class ExecutorMonitor(
     var pendingRemoval: Boolean = false
     var decommissioning: Boolean = false
     var hasActiveShuffle: Boolean = false
+    // whether the executor is temporarily excluded by the `HealthTracker`

Review comment:
       we should expand this to state excluded for entire application (I 
realize HealthTracker implies this but would like to be more explicit), does 
not include excluded within the stage level.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to