xuanyuanking opened a new pull request #24350: [SPARK-27348][Core] 
HeartbeatReceiver should remove lost executors from 
CoarseGrainedSchedulerBackend
URL: https://github.com/apache/spark/pull/24350
 
 
   ## What changes were proposed in this pull request?
   
   When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove 
lost executors from CoarseGrainedSchedulerBackend. When a connection of an 
executor is not gracefully shut down, CoarseGrainedSchedulerBackend may not 
receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still 
thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask 
TaskScheduler to run tasks on this lost executor. This task will never finish 
and the job will hang forever.
   
   ## How was this patch tested?
   
   Add UT in HeartbeatReceiverSuite.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to