[
https://issues.apache.org/jira/browse/SPARK-24387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490293#comment-16490293
]
Rui Li commented on SPARK-24387:
--------------------------------
A snippet of the log w/ some fields masked:
{noformat}
[Stage 2:======================================================>(199 + 1) /
200]18/05/20 05:37:07 WARN HeartbeatReceiver: Removing executor 1100 with no
recent heartbeats: 345110 ms exceeds timeout 300000 ms
18/05/20 05:37:07 ERROR YarnClusterScheduler: Lost executor 1100 on HOSTA:
Executor heartbeat timed out after 345110 ms
18/05/20 05:37:07 WARN TaskSetManager: Lost task 55.0 in stage 2.0 (TID 12080,
HOSTA, executor 1100): ExecutorLostFailure (executor 1100 exited caused by one
of the running tasks) Reason: Executor heartbeat timed out after 345110 ms
18/05/20 05:37:07 INFO DAGScheduler: Executor lost: 1100 (epoch 2)
18/05/20 05:37:07 INFO DAGScheduler: Host added was in lost list earlier: HOSTA
18/05/20 05:37:07 INFO TaskSetManager: Starting task 55.1 in stage 2.0 (TID
12225, HOSTA, executor 1100, partition 55, PROCESS_LOCAL, 6227 bytes)
{noformat}
> Heartbeat-timeout executor is added back and used again
> -------------------------------------------------------
>
> Key: SPARK-24387
> URL: https://issues.apache.org/jira/browse/SPARK-24387
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.1.0
> Reporter: Rui Li
> Priority: Major
>
> In our job, when there's only one task and one executor running, the
> executor's heartbeat is lost and driver decides to remove it. However, the
> executor is added again and the task's retry attempt is scheduled to that
> executor, almost immediately after the executor is marked as lost.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]