Thomas Graves created SPARK-24909:
-------------------------------------
Summary: Spark scheduler can hang with fetch failures and executor
lost and multiple stage attempts
Key: SPARK-24909
URL: https://issues.apache.org/jira/browse/SPARK-24909
Project: Spark
Issue Type: Bug
Components: Scheduler
Affects Versions: 2.3.1
Reporter: Thomas Graves
The DAGScheduler can hang if the executor was lost (due to fetch failure) and
all the tasks in the tasks sets are marked as completed.
([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
It never creates new task attempts in the task scheduler but the dag scheduler
still has pendingPartitions.
18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage
44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874
bytes)
18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44
(repartition at Lift.scala:191) as failed due to a fetch failure from
ShuffleMapStage 42 (map at foo.scala:27)
18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42
(map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due
to fetch failure
....
18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor:
33 (epoch 18)
18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44
(MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing
parents
18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with
59955 tasks
18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage
44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus
ShuffleMapTask(44, 55769) completion from executor 33
In the logs above you will see that task 55769.0 finished after the executor
was lost and a new task set was started. The DAG scheduler says "Ignoring
possibly bogus".. but in the TaskSetManager side it has marked those tasks as
completed for all stage attempts.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]