Lijia Liu created SPARK-25211:
---------------------------------
Summary: speculation and fetch failed result in hang of job
Key: SPARK-25211
URL: https://issues.apache.org/jira/browse/SPARK-25211
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.2.2
Reporter: Lijia Liu
In current `DAGScheduler.handleTaskCompletion` code, when a shuffleMapStage
with job not in runningStages and its `pendingPartitions` is empty, the job of
this shuffleMapStage will never complete.
**Think about below**
1. Stage 0 runs and generates shuffle output data.
2. Stage 1 reads the output from stage 0 and generates more shuffle data. It
has two tasks with the same partition: ShuffleMapTask0 and ShuffleMapTask0.1.
3. ShuffleMapTask0 fails to fetch blocks and sends a FetchFailed to the driver.
The driver resubmits stage 0 and stage 1. The driver will place stage 0 in
runningStages and place stage 1 in waitingStages.
4. ShuffleMapTask0.1 successfully finishes and sends Success back to driver.
The driver will add the mapstatus to the set of output locations of stage 1.
because of stage 1 not in runningStages, the job will not complete.
5. stage 0 completes and the driver will run stage 1. But, because the output
sets of stage 1 is complete, the drive will not submit any tasks and make stage
1 complte right now(line 1074). Because the job complete relay on the
`CompletionEvent` and there will never a `CompletionEvent` come, the job will
hang.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]