[GitHub] [spark] tgravescs commented on a change in pull request #25047: [WIP][SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone

GitBox Mon, 29 Jul 2019 08:29:20 -0700

tgravescs commented on a change in pull request #25047: 
[WIP][SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone
URL: https://github.com/apache/spark/pull/25047#discussion_r308294072


 ##########
 File path: core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala
 ##########
 @@ -110,6 +122,16 @@ private[deploy] object DeployMessages {
 
   case class ReconnectWorker(masterUrl: String) extends DeployMessage
 
+  /**
+   * Ask the worker to release the indicated resources in 
ALLOCATED_RESOURCES_JSON_FILE
+   * @param pid process id of the target worker
+   * @param toRelease the resources expected to release
+   */
+  case class ReleaseResources(
 
 Review comment:
   You are saying a worker is still alive but something caused the worker to 
disconnect from Master.  
   
   so there are 2 cases:
   1) another worker exists and we can send this message to it to try to free 
those resources. But what if this Worker is still alive and whatever its 
running is still using those resources?  like ML algo using the GPU.  Do we 
really want the other worker to say their free if its potentially being used.
   2) no other worker exists on that node so this message isn't sent anyway.
   
   There are obviously many cases the worker could disconnect. If the process 
is still in a sane state then it will either reregister with master or 
eventually fail to contact master and release the resources at that point.    I 
think in both of those cases the Executors should have been killed or similarly 
failed due to too many heartbeat failures, correct?  At least the master was 
told they were lost.  So in those cases if we just wait then the resources 
should be cleaned up without us fixing anything.  The other case is worker 
isn't alive, this then would be covered by our PID checks, the last case is the 
worker and possibly executors are in some really bad state such that they are 
alive but aren't connected and heartbeat failures aren't causing them to exit.  
In this case I would say its really had to say if its using the resources and 
I'm not sure we want to reclaim them automatically. It might be best to leave 
them still allocated until someone can clean them up.  
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tgravescs commented on a change in pull request #25047: [WIP][SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone

Reply via email to