Github user kayousterhout commented on the pull request:
https://github.com/apache/spark/pull/10045#issuecomment-161061306
After thinking about this (and some of the corner cases) more I would
propose two steps here:
(1) For blacklisting executors per-task, I think @davies proposal makes
most sense, which was to allow a task to run twice on a particular executor,
and if it fails both times on that executor, blacklist the executor for the
task permanently. The benefit of that proposal is that it works without
requiring the user to configure a blacklist timeout; determining a timeout is
very workload dependent, and I think puts an unnecessary burden on the user. I
think this handles the transient failure issue relatively well, since the
executor is only blacklisted for the particular task that failed twice, so
there are plenty of future opportunities (with other tasks) for the executor to
keep running things. I do think it makes sense to have a per-task blacklist
that's more aggressive than the more general executor/host blacklist mechanism.
@mridulm does this seem reasonable?
(2) We still need a solution for handling executors / hosts that are more
permanently problematic, in which case the above approach of blacklisting only
for a particular task is too inefficient. Let's have that discussion on #8760,
to separate the two issues.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]