GitHub user tgravescs opened a pull request:

    https://github.com/apache/spark/pull/21976

     [SPARK-24909] Spark scheduler can hang when fetch failures, executor

    …lost, task running on lost executor, and multiple stage attempts
    
    ## What changes were proposed in this pull request?
    this PR is actually reverting the change in SPARK-19263, so that it always 
does shuffleStage.pendingPartitions -= task.partitionId.   The change in 
SPARK-23433, should fix the issue originally from SPARK-19263.
    
    ## How was this patch tested?
    
    Unit tests.  The condition happens on a race which I haven't reproduced on 
a real customer, just see it sometimes on customers jobs in a real cluster.  
    I am also working on adding spark scheduler integration tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tgravescs/spark SPARK-24909

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21976.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21976
    
----
commit 82243746fb8709c925bea97c25cb57c82cec8c2f
Author: Thomas Graves <tgraves@...>
Date:   2018-08-02T17:37:00Z

    [SPARK-24909] Spark scheduler can hang when fetch failures, executor lost, 
task running on lost executor, and multiple stage attempts

commit 54646148730462e34a32d81200530cf50dbf7a51
Author: Thomas Graves <tgraves@...>
Date:   2018-08-02T17:39:08Z

    add log message back

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to