[ 
https://issues.apache.org/jira/browse/SPARK-30297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999237#comment-16999237
 ] 

haiyangyu commented on SPARK-30297:
-----------------------------------

[~r...@databricks.com] [~AMateenM]

[~dongjoon]

please look this patch ,thanks!

> Executor heartbeat expired cause app hung up forever
> ----------------------------------------------------
>
>                 Key: SPARK-30297
>                 URL: https://issues.apache.org/jira/browse/SPARK-30297
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0, 2.4.4
>            Reporter: haiyangyu
>            Priority: Major
>
> h3. *Backgroud*
> The driver can't sense this executor was lost through the network connection 
> disconnection If an executor was lost in the network and it have not 
> responsed rst and close packet to driver, so driver can only sense this 
> executor dead through heartbeat expired.
> h3. *Problems*
> Heartbeat expiration processing flow as follows:
>  # Executor heartbeat expired as above.
>  # HeartbeatReceiver will call scheduler executor lost to rescheduler the 
> tasks on this executor.
>  # HeartbeatReceiver kill the executor.
> The tasks on the dead executor have a chance to rescheduled on this dead 
> executor again if the task rescheduler before the executor has't remove from 
> executorBackend, it will send launch task to this executor again, the 
> executor will not response and the driver can't sense through heartbeat 
> beause the executor has lost in network. This cause those tasks rescheduled 
> on this lost executor can't finish forever, and the app will hung up here 
> forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to