GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/16189

    [SPARK-18761][CORE][WIP] Introduce "task reaper" to oversee task killing in 
executors

    ## What changes were proposed in this pull request?
    
    Spark's current task cancellation / task killing mechanism is "best effort" 
because some tasks may not be interruptible or may not respond to their 
"killed" flags being set. If a significant fraction of a cluster's task slots 
are occupied by tasks that have been marked as killed but remain running then 
this can lead to a situation where new jobs and tasks are starved of resources 
that are being used by these zombie tasks.
    
    This patch aims to address this problem by adding a "task reaper" mechanism 
to executors. At a high-level, task killing now launches a new thread which 
attempts to kill the task and then watches the task and periodically checks 
whether it has been killed. The TaskReaper will periodically re-attempt to call 
`TaskRunner.kill()` and will log warnings if the task keeps running. I modified 
TaskRunner to rename its thread at the start of the task, allowing TaskReaper 
to take a thread dump and filter it in order to log stacktraces from tasks that 
we are waiting to finish. After a configurable timeout, if the task has not 
been killed then the TaskReaper will throw an exception to trigger executor JVM 
death, thereby forcibly freeing any resources consumed by the zombie tasks.
    
    There are some aspects of the design that I'd like to think about a bit 
more, but I've opened this as `[WIP]` now in order to solicit early feedback. 
I'll comment on some of my thoughts directly on the diff.
    
    ## How was this patch tested?
    
    Tested via a new test case in `JobCancellationSuite`, plus manual testing. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark cancellation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16189.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16189
    
----
commit 2c28594b980845bda1d4db7ae866a91caaad4fff
Author: Josh Rosen <[email protected]>
Date:   2016-12-07T06:17:38Z

    Add failing regression test.

commit a46f9c2436d533ff838674cb63e397d1007e34de
Author: Josh Rosen <[email protected]>
Date:   2016-12-07T06:18:43Z

    Add TaskReaper to executor.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to