GitHub user liutang123 opened a pull request:

    https://github.com/apache/spark/pull/22202

    [SPARK-25211][Core] speculation and fetch failed result in hang of job

    ## What changes were proposed in this pull request?
    
    In current `DAGScheduler.handleTaskCompletion` code, when a shuffleMapStage 
with job not in runningStages and its `pendingPartitions` is empty, the job of 
this shuffleMapStage will never complete.
    
    *Think about below*
    
    1. Stage 0 runs and generates shuffle output data.
    
    2. Stage 1 reads the output from stage 0 and generates more shuffle data. 
It has two tasks with the same partition: ShuffleMapTask0 and ShuffleMapTask0.1.
    
    3. ShuffleMapTask0 fails to fetch blocks and sends a FetchFailed to the 
driver. The driver resubmits stage 0 and stage 1. The driver will place stage 0 
in runningStages and place stage 1 in waitingStages.
    
    4. ShuffleMapTask0.1 successfully finishes and sends Success back to 
driver. The driver will add the mapstatus to the set of output locations of 
stage 1. because of stage 1 not in runningStages, the job will not complete.
    
    5. stage 0 completes and the driver will run stage 1. But, because the 
output sets of stage 1 is complete, the drive will not submit any tasks and 
make stage 1 complte right now. Because the job complete relay on the 
`CompletionEvent` and there will never a `CompletionEvent` come, the job will 
hang.
    
    ## How was this patch tested?
    
    UT

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liutang123/spark SPARK-25211

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22202.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22202
    
----
commit 4f51199daafec0466a5ac836c4f6281f5ba45381
Author: liulijia <liutang123@...>
Date:   2018-08-23T13:42:13Z

    [SPARK-25211][Core] speculation and fetch failed result in hang of job

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to