[GitHub] [spark] tgravescs commented on a change in pull request #27313: [SPARK-29148][CORE] Add stage level scheduling dynamic allocation and scheduler backend changes

GitBox Thu, 06 Feb 2020 08:01:34 -0800

tgravescs commented on a change in pull request #27313: [SPARK-29148][CORE] Add 
stage level scheduling dynamic allocation and scheduler backend changes
URL: https://github.com/apache/spark/pull/27313#discussion_r375924289


 ##########
 File path: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala
 ##########
 @@ -423,21 +510,30 @@ private[spark] class ExecutorAllocationManager(
    */
   private def removeExecutors(executors: Seq[String]): Seq[String] = 
synchronized {
     val executorIdsToBeRemoved = new ArrayBuffer[String]
-
     logDebug(s"Request to remove executorIds: ${executors.mkString(", ")}")
-    val numExistingExecutors = executorMonitor.executorCount - 
executorMonitor.pendingRemovalCount
-
-    var newExecutorTotal = numExistingExecutors
+    val numExecutorsTotalPerRpId = mutable.Map[Int, Int]()
     executors.foreach { executorIdToBeRemoved =>
-      if (newExecutorTotal - 1 < minNumExecutors) {
-        logDebug(s"Not removing idle executor $executorIdToBeRemoved because 
there are only " +
-          s"$newExecutorTotal executor(s) left (minimum number of executor 
limit $minNumExecutors)")
-      } else if (newExecutorTotal - 1 < numExecutorsTarget) {
-        logDebug(s"Not removing idle executor $executorIdToBeRemoved because 
there are only " +
-          s"$newExecutorTotal executor(s) left (number of executor target 
$numExecutorsTarget)")
+      val rpId = getResourceProfileIdOfExecutor(executorIdToBeRemoved)
+      if (rpId == UNKNOWN_RESOURCE_PROFILE_ID) {
+        logWarning(s"Not removing executor $executorIdsToBeRemoved because 
couldn't find " +
+          "ResourceProfile for it!")
 
 Review comment:
   so if events come out of order in the ExecutorMonitor we aren't sure what 
resource profile they are for, or if the executor monitor just doesn't know 
about the executor, in both of those cases its returns the UNKNOWN. When I 
wrote this code I was not expecting it to normally happen but actually now that 
I look more  I think there is a possible race where the remove event could come 
at the same time as we calculate the timed out executors. The remove event 
would cause executor monitor to remove it from the list and then you would get 
UNKNOWN back.  The calculation of timed out executors which calls into this 
removeExecutors takes a snapshot of the executors so it could return one that 
has later been removed.  I think I'll modify this to pass the ResourceProfile 
in with the list of executors to remove, then we don't have to worry about that 
race.  thanks for bringing this up
   
   So currently we are not cleaning up if all resource profiles have been 
gc'ed. The executors will idle timeout - down to the minimum set. note I think 
you saw that discussion but we also want to make the initial, minimum, max 
configs settable per resource profile - but was going to do that in followon as 
well.
   And many of the datastructures do not remove resource profiles because its 
not completely obvious when they are done being used. My  thought was the 
number of ResourceProfiles is expected to be small, I tried to keep the 
datastructures memory usage small, and was thinking we can make this better in 
follow on jiras.
   
   I was thinking the main initial use case was the ETL -> ML type case so was 
thinking you would only have a couple ResourceProfiles. 
   
   If you think its needed now, I can definitely start looking more into this. 
Just let me know your thoughts.
   I filed https://issues.apache.org/jira/browse/SPARK-30749 and 
https://issues.apache.org/jira/browse/SPARK-30750 to specifically track doing 
that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tgravescs commented on a change in pull request #27313: [SPARK-29148][CORE] Add stage level scheduling dynamic allocation and scheduler backend changes

Reply via email to