Github user vanzin commented on the issue:
https://github.com/apache/spark/pull/18213
I'm not so sure about this... this is a fundamental change in how the
feature works. With this change, there are only two cases where the AM will be
retried:
- the client-mode AM
- when YARN kills the driver (`EXIT_EARLY`)
Every other error will disable re-attempts. That includes the user app
errors you mention but it also includes infra issues (like an app failing
because some temporary issue with HDFS causing tasks to fail, for example).
I've seen the idea floating around about setting the max attempts to 1 by
default, and I think that's a better solution (at least for cluster mode). That
means users would have to opt-in to driver re-attempts, which mitigates a lot
of the problems with blindly re-attempting to run the whole application.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]