Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
can you clarify the situations you are seeing issues? What happened to the
NM in this case. If you have work preserving restart I would think this would
actually cause you more problems. The NM could temporarily be down during
rolling upgrade and if you blacklist it, it won't be used for a long time.
We have seen issues with TEZ and MR where blacklisting on fetch failures
caused more issues then it solved. Most of the fetch failures were transient
issues and it caused way more things to be rerun then was actually needed.
This is why we explicitly left it out of the blacklisting feature. See the
design doc here in the jira https://issues.apache.org/jira/browse/SPARK-8425.
I didn't have a chance to do a full review but you seem to be blacklisting
an executor, what executor is this blacklisting? It looks like you are
immediately blacklisting on any fetch failure rather then allowing configurable
number?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]