Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
(a) right, this is a behavior change ... seemed fair since earlier behavior
was undocumented, and I don't see a strong reason to maintain the exact same
behavior as before. I think its fair for us to change behavior here, though we
should try to support general use cases (as I was discussing above). timeout
is not enforced whatsoever in this pr (the only reason its here at all is that
it was easier for me to pull those bits from the full change in here as well).
(b) executor blacklisting is a somewhat odd middle-ground, you're right.
One motivating case from yarn's bad disk detection -- it'll exclude the bad
disk from future containers, but not existing ones. so you can have one node
with some good containers, and some bad ones. Admittedly this solution still
isn't great in that case, since the default confs will lead to the entire node
getting pushed into the blacklist with just 2 bad executors. I've also seen
executors behaving badly though others on the same node are fine, without any
clear reason, so its meant to handle these poorly understood cases.
Admittedly, for the main goals, things work fine if we only had blacklisting at
the node level.
(c) -- yup, those changes are already in the larger pr.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]