tgravescs commented on a change in pull request #25047: 
[WIP][SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone
URL: https://github.com/apache/spark/pull/25047#discussion_r309354579
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala
 ##########
 @@ -110,6 +122,16 @@ private[deploy] object DeployMessages {
 
   case class ReconnectWorker(masterUrl: String) extends DeployMessage
 
+  /**
+   * Ask the worker to release the indicated resources in 
ALLOCATED_RESOURCES_JSON_FILE
+   * @param pid process id of the target worker
+   * @param toRelease the resources expected to release
+   */
+  case class ReleaseResources(
 
 Review comment:
   Maybe I missed it, I definitely see the Master doing  
releaseResourcesIfPossible which is checking from the Master side whether the 
worker is alive. But the Master might think its not alive if the network 
disconnected, but on the Worker it might still be up.  In the normal case 
Worker will exit because of heartbeat and release its own resources (thus 
doesn't need this message), if it doesn't exit for some reason then the Worker 
and its executors could potentially still be alive using the resources.  In 
this very edge case I would propose we just let it be and don't release the 
resources. I would hope cluster admin would see this and go figure out what was 
going on.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to