GitHub user juanrh opened a pull request:
https://github.com/apache/spark/pull/19590
[WIP][SPARK-22148][CORE] TaskSetManager.abortIfCompletelyBlacklisted should
not abort when all current executors are blacklisted but dynamic allocation is
enabled
## What changes were proposed in this pull request?
I've been working on this issue, and I would like to get your feedback on
the following approach. The idea is that instead of failing in
`TaskSetManager.abortIfCompletelyBlacklisted`, when a task cannot be scheduled
in any executor but dynamic allocation is enabled, we will register this task
with `ExecutorAllocationManager`. Then `ExecutorAllocationManager` will request
additional executors for these "unscheduleable tasks" by increasing the value
returned in `ExecutorAllocationManager.maxNumExecutorsNeeded`. This way we are
counting these tasks twice, but this makes sense because the current executors
don't have any slot for these tasks, so we actually want to get new executors
that are able to run these tasks. To avoid a deadlock due to tasks being
unscheduleable forever, we store the timestamp when a task was registered as
unscheduleable, and in `ExecutorAllocationManager.schedule` we abort the
application if there is some task that has been unscheduleable for a
configurable a
ge threshold. This way we give an opportunity to dynamic allocation to get
more executors that are able to run the tasks, but we don't make the
application wait forever.
## How was this patch tested?
This is WIP for discussion, unit tests will be provided later on
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/juanrh/spark hortala-SPARK-22148
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19590.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19590
----
commit 7baa51e5cbafc93a0fe56eb11d8c76c69b51c893
Author: Juan Rodriguez Hortala <[email protected]>
Date: 2017-10-21T00:40:07Z
first prototype
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]