Github user mwws commented on the pull request:
https://github.com/apache/spark/pull/8760#issuecomment-160544588
@kayousterhout For example, task can failed because some host can't access
source/network, local storage of some host is full, broken class linkage and
other run-time exception. In such case, I think the worker will keep alive and
cluster managers can't help too much. So that farther tasks will still submit
to the problematic executors/nodes. Original code base already contain similar
blacklist mechanism `executorIsBlacklisted`, but blacklist is maintained in
every TaskSetManager (called `failedExecutors`) and not shared with each other,
which means if a new TaskSet comes, it cannot benefit from previous experience
of other TaskSet.
This change adds complexity, but from user perspective, he can just use
build-in `SingleNodeStrategy` which behave the same as original logic, or use
build-in `ExecutorAndNodeStrategy` by only changing configuration. At the same
time, it provides flexibility for advance user to customize it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]