Lijia Liu created SPARK-25211:
---------------------------------

             Summary: speculation and fetch failed result in hang of job
                 Key: SPARK-25211
                 URL: https://issues.apache.org/jira/browse/SPARK-25211
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.2.2
            Reporter: Lijia Liu


In current `DAGScheduler.handleTaskCompletion` code, when a shuffleMapStage 
with job not in runningStages and its `pendingPartitions` is empty, the job of 
this shuffleMapStage will never complete.

**Think about below**

1. Stage 0 runs and generates shuffle output data.

2. Stage 1 reads the output from stage 0 and generates more shuffle data. It 
has two tasks with the same partition: ShuffleMapTask0 and ShuffleMapTask0.1.

3. ShuffleMapTask0 fails to fetch blocks and sends a FetchFailed to the driver. 
The driver resubmits stage 0 and stage 1. The driver will place stage 0 in 
runningStages and place stage 1 in waitingStages.

4. ShuffleMapTask0.1 successfully finishes and sends Success back to driver. 
The driver will add the mapstatus to the set of output locations of stage 1. 
because of stage 1 not in runningStages, the job will not complete.

5. stage 0 completes and the driver will run stage 1. But, because the output 
sets of stage 1 is complete, the drive will not submit any tasks and make stage 
1 complte right now(line 1074). Because the job complete relay on the 
`CompletionEvent` and there will never a `CompletionEvent` come, the job will 
hang.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to