Github user squito commented on the issue:
https://github.com/apache/spark/pull/17088
first, I think we should change the hard-coded limit of 4 stage retries.
Its clear to me there is an important reason why users would want a higher
limit, so lets make it a config. That is a very simple change. (That doesn't
mean we shouldn't be changing something else as well.)
As with https://github.com/apache/spark/pull/17113, though this is a big
change, it seems to actually be more consistent for spark. Of course some
failures are transient, but (as has already been pointed out) (a) even the
existing behavior will make you do unnecessary work for transient failures and
(b) this just slightly increases the amount of work that has to be repeated for
those transient failures.
I'm also wondering if there are other options, eg:
* removing all output only if there are multiple fetch failures, spread out
far enough in time
* waiting to retry a stage, instead of retrying after one fetch failure.
* there is already a 200 ms delay before the stage gets retried -- we
could make that configurable
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1317
* or wait for multiple failures
* multiple running tasksets for one stage (as @sitalkedia has proposed
doing separately)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]