GitHub user jsoltren opened a pull request:

    https://github.com/apache/spark/pull/18604

    [SPARK-21219][CORE] Task retry occurs on same executor due to race co…

    …ndition with blacklisting
    
    There's a race condition in the current TaskSetManager where a failed task 
is added for retry (addPendingTask), and can asynchronously be assigned to an 
executor *prior* to the blacklist state (updateBlacklistForFailedTask), the 
result is the task might re-execute on the same executor.  This is particularly 
problematic if the executor is shutting down since the retry task immediately 
becomes a lost task (ExecutorLostFailure).  Another side effect is that the 
actual failure reason gets obscured by the retry task which never actually 
executed.  There are sample logs showing the issue in the 
https://issues.apache.org/jira/browse/SPARK-21219
    
    The fix is to change the ordering of the addPendingTask and 
updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask
    
    Implemented a unit test that verifies the task is black listed before it is 
added to the pending task.  Ran the unit test without the fix and it fails.  
Ran the unit test with the fix and it passes.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Eric Vandenberg <[email protected]>
    
    Closes #18427 from ericvandenbergfb/blacklistFix.
    
    ## What changes were proposed in this pull request?
    
    This is a backport of the fix to SPARK-21219, already checked in as 96d58f2.
    
    ## How was this patch tested?
    
    Ran TaskSetManagerSuite tests locally.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jsoltren/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18604.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18604
    
----
commit 2ea00a58a18359f8916b7a9f5e56ae7bea9d1208
Author: Eric Vandenberg <[email protected]>
Date:   2017-07-10T06:40:20Z

    [SPARK-21219][CORE] Task retry occurs on same executor due to race 
condition with blacklisting
    
    There's a race condition in the current TaskSetManager where a failed task 
is added for retry (addPendingTask), and can asynchronously be assigned to an 
executor *prior* to the blacklist state (updateBlacklistForFailedTask), the 
result is the task might re-execute on the same executor.  This is particularly 
problematic if the executor is shutting down since the retry task immediately 
becomes a lost task (ExecutorLostFailure).  Another side effect is that the 
actual failure reason gets obscured by the retry task which never actually 
executed.  There are sample logs showing the issue in the 
https://issues.apache.org/jira/browse/SPARK-21219
    
    The fix is to change the ordering of the addPendingTask and 
updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask
    
    Implemented a unit test that verifies the task is black listed before it is 
added to the pending task.  Ran the unit test without the fix and it fails.  
Ran the unit test with the fix and it passes.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Eric Vandenberg <[email protected]>
    
    Closes #18427 from ericvandenbergfb/blacklistFix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to