GitHub user lirui-apache opened a pull request:
https://github.com/apache/spark/pull/21486
[SPARK-24387][Core] Heartbeat-timeout executor is added back and used again
## What changes were proposed in this pull request?
When an executor's heartbeat is lost, we call scheduler.executorLost before
we tell the backend to kill the executor. TaskSchedulerImpl asks the backend to
revive offers in executorLost. If this is the only executor, it's possible the
backend will offer it again to TaskSchedulerImpl, and the retried task is
scheduled to this executor.
This patch proposes to call scheduler.executorLost after the executor is
killed. At this point, the executor has been marked as pending-to-remove and
won't be offered again.
## How was this patch tested?
Added a new test case in HeartbeatReceiverSuite. W/o the fix this test case
fails.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/lirui-apache/spark SPARK-24387
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21486.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21486
----
commit 189f2696dab47a23b3f2a48a313a72dc4ec77c80
Author: Rui Li <lirui@...>
Date: 2018-06-02T08:25:10Z
Call executorLost after the executor is killed
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]