Github user squito commented on the pull request:
https://github.com/apache/spark/pull/5636#issuecomment-95335417
Thanks @ilganeli I took a quick look and have some high-level comments:
* Checking for the exact same string is too restrictive IMO. Eg., will the
failure message include host names of the fetch failed? Even without that
detail, I bet there are a lot of cases where the same real error can result in
different msgs.
* I think the count for stage failure should be reset every time the job
completes. If you have a really long running job , I could imagine some stage
that is dependency on many downstream stages (eg., imagine a streaming job,
where some common RDD is joined against lots of incoming batches). On a big
cluster, eventually nodes will go down and stages will fail, but as long as the
subsequent retry works, everything is fine. Over time, that same stage might
fail a number of times, but as long as there is no more than one failure
between each success, it would be completely normal (even expected to some
extent).
* Maybe we should still allow the old behavior of infinite retry, eg.,
maybe if the `spark.stage.maxFailures` is set to -1? Though to be honest I
can't really think of any reason you'd want infinite retry, I just wonder if we
should leave the door open since it is a behavior change.
thanks for working on this! This will be a great addition, I've seen this
come up in a number of cases and its really hard for the average user to figure
out what is going on, this will be a big help.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]