Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17445
there is a large discussion about how to handle fetch failures going on in
https://issues.apache.org/jira/browse/SPARK-20178. The fact that you got a
fetch failure does not mean that all blocks are invalid or that the external
shuffle service is totally down. It could very well be an intermittent thing
as well. There was also a pr to make the stage attempts configurable so you
could increase that.
If a lot of people are seeing this issue the question is do we need to do
something shorter term to handle this well we are discussing SPARK-20178.
Certainly if we are seeing more actual job failures due to it, it would be
better to invalidate all the output and it possibly runs longer but at least it
doesn't fail.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]