[GitHub] spark pull request: [SPARK-5945] Spark should not retry a stage in...

squito Wed, 22 Apr 2015 14:02:39 -0700

Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/5636#issuecomment-95335417
  
    Thanks @ilganeli  I took a quick look and have some high-level comments:
    
    * Checking for the exact same string is too restrictive IMO.  Eg., will the 
failure message include host names of the fetch failed? Even without that 
detail, I bet there are a lot of cases where the same real error can result in 
different msgs.
    * I think the count for stage failure should be reset every time the job 
completes.  If you have a really long running job , I could imagine some stage 
that is dependency on many downstream stages (eg., imagine a streaming job, 
where some common RDD is joined against lots of incoming batches).  On a big 
cluster, eventually nodes will go down and stages will fail, but as long as the 
subsequent retry works, everything is fine.  Over time, that same stage might 
fail a number of times, but as long as there is no more than one failure 
between each success, it would be completely normal (even expected to some 
extent).
    * Maybe we should still allow the old behavior of infinite retry, eg., 
maybe if the `spark.stage.maxFailures` is set to -1?  Though to be honest I 
can't really think of any reason you'd want infinite retry, I just wonder if we 
should leave the door open since it is a behavior change.
    
    thanks for working on this!  This will be a great addition, I've seen this 
come up in a number of cases and its really hard for the average user to figure 
out what is going on, this will be a big help.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5945] Spark should not retry a stage in...

Reply via email to