[ 
https://issues.apache.org/jira/browse/SPARK-39984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575500#comment-17575500
 ] 

Apache Spark commented on SPARK-39984:
--------------------------------------

User 'kevin85421' has created a pull request for this issue:
https://github.com/apache/spark/pull/37411

> Check workerLastHeartbeat with master before HeartbeatReceiver expires an 
> executor
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-39984
>                 URL: https://issues.apache.org/jira/browse/SPARK-39984
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.4.0
>            Reporter: Kai-Hsun Chen
>            Priority: Major
>
> Currently, the driver’s HeartbeatReceiver will expire an executor if it does 
> not receive any heartbeat from the executor for 120 seconds. However, 120 
> seconds is too long, but we will face other challenges when we try to lower 
> the timeout threshold. To elaborate, when an executor is performing GC, it 
> cannot reply any message.
>  
> Hence, this PR aims to provide a method to lower the timeout. Worker will 
> send heartbeats to master periodically, and thus if HeartbeatReceiver asks 
> master the information about the latest heartbeat from the worker which the 
> executor is on, HeartbeatReceiver can determine whether the heartbeat loss is 
> caused by network issues or other issues (e.g. GC). If the heartbeat loss is 
> not caused by network issues, the HeartbeatReceiver will put the executor 
> into a waitingList rather than expiring it immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to