[
https://issues.apache.org/jira/browse/SPARK-27348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shixiong Zhu updated SPARK-27348:
---------------------------------
Description: When a heartbeat timeout happens in HeartbeatReceiver, it
doesn't remove lost executors from CoarseGrainedSchedulerBackend. When a
connection is not gracefully shut down, CoarseGrainedSchedulerBackend may not
receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still
thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask
TaskScheduler to run tasks on this lost executor. This task will never finish
and the job will hang forever. (was: When a heartbeat timeout happens in
HeartbeatReceiver, it doesn't remove lost executors from
CoarseGrainedSchedulerBackend. When a connection is gracefully shut down,
CoarseGrainedSchedulerBackend will not receive a disconnect event. In this
case, CoarseGrainedSchedulerBackend still thinks a lost executor is still
alive. CoarseGrainedSchedulerBackend may ask TaskScheduler to run tasks on this
lost executor. This task will never finish and the job will hang forever.)
> HeartbeatReceiver doesn't remove lost executors from
> CoarseGrainedSchedulerBackend
> ----------------------------------------------------------------------------------
>
> Key: SPARK-27348
> URL: https://issues.apache.org/jira/browse/SPARK-27348
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.4.0
> Reporter: Shixiong Zhu
> Priority: Major
>
> When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost
> executors from CoarseGrainedSchedulerBackend. When a connection is not
> gracefully shut down, CoarseGrainedSchedulerBackend may not receive a
> disconnect event. In this case, CoarseGrainedSchedulerBackend still thinks a
> lost executor is still alive. CoarseGrainedSchedulerBackend may ask
> TaskScheduler to run tasks on this lost executor. This task will never finish
> and the job will hang forever.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]