GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/16189
[SPARK-18761][CORE][WIP] Introduce "task reaper" to oversee task killing in
executors
## What changes were proposed in this pull request?
Spark's current task cancellation / task killing mechanism is "best effort"
because some tasks may not be interruptible or may not respond to their
"killed" flags being set. If a significant fraction of a cluster's task slots
are occupied by tasks that have been marked as killed but remain running then
this can lead to a situation where new jobs and tasks are starved of resources
that are being used by these zombie tasks.
This patch aims to address this problem by adding a "task reaper" mechanism
to executors. At a high-level, task killing now launches a new thread which
attempts to kill the task and then watches the task and periodically checks
whether it has been killed. The TaskReaper will periodically re-attempt to call
`TaskRunner.kill()` and will log warnings if the task keeps running. I modified
TaskRunner to rename its thread at the start of the task, allowing TaskReaper
to take a thread dump and filter it in order to log stacktraces from tasks that
we are waiting to finish. After a configurable timeout, if the task has not
been killed then the TaskReaper will throw an exception to trigger executor JVM
death, thereby forcibly freeing any resources consumed by the zombie tasks.
There are some aspects of the design that I'd like to think about a bit
more, but I've opened this as `[WIP]` now in order to solicit early feedback.
I'll comment on some of my thoughts directly on the diff.
## How was this patch tested?
Tested via a new test case in `JobCancellationSuite`, plus manual testing.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark cancellation
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16189.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16189
----
commit 2c28594b980845bda1d4db7ae866a91caaad4fff
Author: Josh Rosen <[email protected]>
Date: 2016-12-07T06:17:38Z
Add failing regression test.
commit a46f9c2436d533ff838674cb63e397d1007e34de
Author: Josh Rosen <[email protected]>
Date: 2016-12-07T06:18:43Z
Add TaskReaper to executor.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]