Ngone51 commented on issue #25047: [WIP][SPARK-27371][CORE] Support GPU-aware 
resources scheduling in Standalone
URL: https://github.com/apache/spark/pull/25047#issuecomment-509645441
 
 
   > How do you know the client is running on a node with GPU's or a worker? I 
guess as long as location is the same it doesn't matter.
   
   Yes, you're right. Now, I used `SPARK_HOME/spark_resources`, but may need to 
change  as you referred to user's permission on `SPARK_HOME`.
   
   > It seems unreliable to assume you have multiple workers per node (for the 
case a worker crashes). 
   
   In this PR, We have two ways to prevent resources leak(suppose you're caring 
about this, right?) when a worker crashes:
   1.  We registers a SignalHandler(by 
[SignalUtils](https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/core/src/main/scala/org/apache/spark/util/SignalUtils.scala))
 to handle TERM signal(whether from `kill -9 worker-pid` or `stop-slave.sh`) 
for each Worker. The handler could help the worker to release resources in the 
allocated resources file before it exits.
   
   2. Master can be notified about that worker's crash after it doesn't receive 
heartbeat from the worker for a configured timeout. Once master knows a worker 
crashes, it will randomly select another healthy worker on that same host to 
help that crashed worker to release its allocated resources.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to