[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again

Rui Li (JIRA) Thu, 24 May 2018 23:28:02 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490293#comment-16490293
 ]


Rui Li commented on SPARK-24387:
--------------------------------

A snippet of the log w/ some fields masked:
{noformat}
[Stage 2:======================================================>(199 + 1) / 
200]18/05/20 05:37:07 WARN HeartbeatReceiver: Removing executor 1100 with no 
recent heartbeats: 345110 ms exceeds timeout 300000 ms
18/05/20 05:37:07 ERROR YarnClusterScheduler: Lost executor 1100 on HOSTA: 
Executor heartbeat timed out after 345110 ms
18/05/20 05:37:07 WARN TaskSetManager: Lost task 55.0 in stage 2.0 (TID 12080, 
HOSTA, executor 1100): ExecutorLostFailure (executor 1100 exited caused by one 
of the running tasks) Reason: Executor heartbeat timed out after 345110 ms
18/05/20 05:37:07 INFO DAGScheduler: Executor lost: 1100 (epoch 2)
18/05/20 05:37:07 INFO DAGScheduler: Host added was in lost list earlier: HOSTA
18/05/20 05:37:07 INFO TaskSetManager: Starting task 55.1 in stage 2.0 (TID 
12225, HOSTA, executor 1100, partition 55, PROCESS_LOCAL, 6227 bytes)
{noformat}

> Heartbeat-timeout executor is added back and used again
> -------------------------------------------------------
>
>                 Key: SPARK-24387
>                 URL: https://issues.apache.org/jira/browse/SPARK-24387
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Rui Li
>            Priority: Major
>
> In our job, when there's only one task and one executor running, the 
> executor's heartbeat is lost and driver decides to remove it. However, the 
> executor is added again and the task's retry attempt is scheduled to that 
> executor, almost immediately after the executor is marked as lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-24387) Heartbeat-timeout executor is added back and used again

Reply via email to