Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
@jerryshao are you actually seeing issues with this on real
customer/production jobs? How often? NM failure for us is very rare. I'm not
familiar with how mesos would fail differently, the shuffle service there is
started as a separate service correct?
We would definitely need to make sure that Spark's retry before really
returning fetch failure are good enough to handle the cases with rolling
upgrade or intermittent issues with the shuffle, but with our defaults of 3
retries at 5 seconds each not sure that would cover it.
If people are seeing a need for it and actual issues with it then I'm ok
with the idea as long as we make it configurable to turn it off. I'm not sure
on the blacklist after one fetch failure either. It would seem better to only
blacklist after a couple of things have gotten the failure that way you would
have a better confidence it was really an issue with the node you are fetching
from.
cc @squito
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]