[ https://issues.apache.org/jira/browse/SPARK-30297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999237#comment-16999237 ]
haiyangyu commented on SPARK-30297: ----------------------------------- [~r...@databricks.com] [~AMateenM] [~dongjoon] please look this patch ,thanks! > Executor heartbeat expired cause app hung up forever > ---------------------------------------------------- > > Key: SPARK-30297 > URL: https://issues.apache.org/jira/browse/SPARK-30297 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.0, 2.4.4 > Reporter: haiyangyu > Priority: Major > > h3. *Backgroud* > The driver can't sense this executor was lost through the network connection > disconnection If an executor was lost in the network and it have not > responsed rst and close packet to driver, so driver can only sense this > executor dead through heartbeat expired. > h3. *Problems* > Heartbeat expiration processing flow as follows: > # Executor heartbeat expired as above. > # HeartbeatReceiver will call scheduler executor lost to rescheduler the > tasks on this executor. > # HeartbeatReceiver kill the executor. > The tasks on the dead executor have a chance to rescheduled on this dead > executor again if the task rescheduler before the executor has't remove from > executorBackend, it will send launch task to this executor again, the > executor will not response and the driver can't sense through heartbeat > beause the executor has lost in network. This cause those tasks rescheduled > on this lost executor can't finish forever, and the app will hung up here > forever. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org