tgravescs commented on a change in pull request #25047:
[WIP][SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone
URL: https://github.com/apache/spark/pull/25047#discussion_r308294072
##########
File path: core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala
##########
@@ -110,6 +122,16 @@ private[deploy] object DeployMessages {
case class ReconnectWorker(masterUrl: String) extends DeployMessage
+ /**
+ * Ask the worker to release the indicated resources in
ALLOCATED_RESOURCES_JSON_FILE
+ * @param pid process id of the target worker
+ * @param toRelease the resources expected to release
+ */
+ case class ReleaseResources(
Review comment:
You are saying a worker is still alive but something caused the worker to
disconnect from Master.
so there are 2 cases:
1) another worker exists and we can send this message to it to try to free
those resources. But what if this Worker is still alive and whatever its
running is still using those resources? like ML algo using the GPU. Do we
really want the other worker to say their free if its potentially being used.
2) no other worker exists on that node so this message isn't sent anyway.
There are obviously many cases the worker could disconnect. If the process
is still in a sane state then it will either reregister with master or
eventually fail to contact master and release the resources at that point. I
think in both of those cases the Executors should have been killed or similarly
failed due to too many heartbeat failures, correct? At least the master was
told they were lost. So in those cases if we just wait then the resources
should be cleaned up without us fixing anything. The other case is worker
isn't alive, this then would be covered by our PID checks, the last case is the
worker and possibly executors are in some really bad state such that they are
alive but aren't connected and heartbeat failures aren't causing them to exit.
In this case I would say its really had to say if its using the resources and
I'm not sure we want to reclaim them automatically. It might be best to leave
them still allocated until someone can clean them up.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]