GitHub user jsoltren opened a pull request:
https://github.com/apache/spark/pull/18604
[SPARK-21219][CORE] Task retry occurs on same executor due to race coâ¦
â¦ndition with blacklisting
There's a race condition in the current TaskSetManager where a failed task
is added for retry (addPendingTask), and can asynchronously be assigned to an
executor *prior* to the blacklist state (updateBlacklistForFailedTask), the
result is the task might re-execute on the same executor. This is particularly
problematic if the executor is shutting down since the retry task immediately
becomes a lost task (ExecutorLostFailure). Another side effect is that the
actual failure reason gets obscured by the retry task which never actually
executed. There are sample logs showing the issue in the
https://issues.apache.org/jira/browse/SPARK-21219
The fix is to change the ordering of the addPendingTask and
updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask
Implemented a unit test that verifies the task is black listed before it is
added to the pending task. Ran the unit test without the fix and it fails.
Ran the unit test with the fix and it passes.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Eric Vandenberg <[email protected]>
Closes #18427 from ericvandenbergfb/blacklistFix.
## What changes were proposed in this pull request?
This is a backport of the fix to SPARK-21219, already checked in as 96d58f2.
## How was this patch tested?
Ran TaskSetManagerSuite tests locally.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jsoltren/spark branch-2.2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18604.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18604
----
commit 2ea00a58a18359f8916b7a9f5e56ae7bea9d1208
Author: Eric Vandenberg <[email protected]>
Date: 2017-07-10T06:40:20Z
[SPARK-21219][CORE] Task retry occurs on same executor due to race
condition with blacklisting
There's a race condition in the current TaskSetManager where a failed task
is added for retry (addPendingTask), and can asynchronously be assigned to an
executor *prior* to the blacklist state (updateBlacklistForFailedTask), the
result is the task might re-execute on the same executor. This is particularly
problematic if the executor is shutting down since the retry task immediately
becomes a lost task (ExecutorLostFailure). Another side effect is that the
actual failure reason gets obscured by the retry task which never actually
executed. There are sample logs showing the issue in the
https://issues.apache.org/jira/browse/SPARK-21219
The fix is to change the ordering of the addPendingTask and
updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask
Implemented a unit test that verifies the task is black listed before it is
added to the pending task. Ran the unit test without the fix and it fails.
Ran the unit test with the fix and it passes.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Eric Vandenberg <[email protected]>
Closes #18427 from ericvandenbergfb/blacklistFix.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]