GitHub user Ngone51 opened a pull request:
https://github.com/apache/spark/pull/20998
[SPARK-23888][CORE] speculative task should not run on a given host where
another attempt is already running on
## What changes were proposed in this pull request?
There's a bug in:
```
/** Check whether a task is currently running an attempt on a given host */
private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = {
taskAttempts(taskIndex).exists(_.host == host)
}
```
This will ignore hosts which only have finished attempts, so we should
check whether the attempt is currently running on the given host.
And it is possible for a speculative task to run on a host where another
attempt failed here before.
Assume we have only two machines: host1, host2. We first run task0.0 on
host1. Then, due to a long time waiting for task0.0, we launch a speculative
task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run
since there's already a copy running on host2. After another long time
waiting, we launch a new speculative task0.2. And, now, we can run task0.2 on
host1 again, since there's no more running attempt on host1.
## How was this patch tested?
Added.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Ngone51/spark SPARK-23888
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20998.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20998
----
commit 3584a09410d72207d8a00dcbf385b2e276925fc8
Author: wuyi <ngone_5451@...>
Date: 2018-04-06T15:34:18Z
add task attempt running check in hasAttemptOnHost
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]