Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/10045#issuecomment-160879820
@kayousterhout I would have preferred if this "feature" had been actually
fixed properly by now - it was added as a temporary workaround (and so
intentionally not documented so that we can remove it later) to an immediate
problem for which we did not have a principled solution for at that time.
@davies The proposed change (to infinitely blacklist an executor) is not
unsound IMO. Task failures are usually due to transient reasons - temporary
disk or memory limitation, etc - infinitely blacklisting makes assumption about
locality level timeouts, cost-benefit of retrying vs moving RDD partition to
other node, etc.
On other hand, I do see user's complaining about needing to blacklist nodes
as opposed to executors fairly commonly - which is something we have seen is an
issue too (bad memory or disk for example causing most tasks on a node to fail).
It would be great if the current hack is removed entirely and we had a
better more principled solution to this problem which handles executor, node
and rack (required ?) blacklisting - rather than duct tape it further : but
since I am not an active contributor off late, I might be oversimplifying the
effort required :-) Something similar to what I did for task scheduling perhaps
?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]