Github user squito commented on the issue:

    https://github.com/apache/spark/pull/17088
  
    first, I think we should change the hard-coded limit of 4 stage retries.  
Its clear to me there is an important reason why users would want a higher 
limit, so lets make it a config.  That is a very simple change.  (That doesn't 
mean we shouldn't be changing something else as well.)
    
    As with https://github.com/apache/spark/pull/17113, though this is a big 
change, it seems to actually be more consistent for spark.  Of course some 
failures are transient, but (as has already been pointed out) (a) even the 
existing behavior will make you do unnecessary work for transient failures and 
(b) this just slightly increases the amount of work that has to be repeated for 
those transient failures.
    
    I'm also wondering if there are other options, eg:
    * removing all output only if there are multiple fetch failures, spread out 
far enough in time
    * waiting to retry a stage, instead of retrying after one fetch failure. 
        * there is already a 200 ms delay before the stage gets retried -- we 
could make that configurable 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1317
        * or wait for multiple failures
    * multiple running tasksets for one stage (as @sitalkedia has proposed 
doing separately)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to