tgravescs commented on a change in pull request #25047:
[WIP][SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone
URL: https://github.com/apache/spark/pull/25047#discussion_r309354579
##########
File path: core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala
##########
@@ -110,6 +122,16 @@ private[deploy] object DeployMessages {
case class ReconnectWorker(masterUrl: String) extends DeployMessage
+ /**
+ * Ask the worker to release the indicated resources in
ALLOCATED_RESOURCES_JSON_FILE
+ * @param pid process id of the target worker
+ * @param toRelease the resources expected to release
+ */
+ case class ReleaseResources(
Review comment:
Maybe I missed it, I definitely see the Master doing
releaseResourcesIfPossible which is checking from the Master side whether the
worker is alive. But the Master might think its not alive if the network
disconnected, but on the Worker it might still be up. In the normal case
Worker will exit because of heartbeat and release its own resources (thus
doesn't need this message), if it doesn't exit for some reason then the Worker
and its executors could potentially still be alive using the resources. In
this very edge case I would propose we just let it be and don't release the
resources. I would hope cluster admin would see this and go figure out what was
going on.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]