Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/5636#discussion_r29312240
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -984,6 +984,11 @@ class DAGScheduler(
job.numFinished += 1
// If the whole job has finished, remove it
if (job.numFinished == job.numPartitions) {
+ // Clear failure count for this stage, now that it's
succeeded.
+ // This ensures that even if subsequent stages fail,
triggering
+ // a recompute of this stage, we abort because of
those failures.
+ stage.clearFailures()
--- End diff --
also is that comment supposed to say "... we do not abort ..."? What do
you think of rewording to something like:
We only limit *consecutive* failures of stage attempts, so clear the
failure count on success. This is just in
case this stage is a dependency for *lots* downstream stages, eg. in a
long-running streaming app -- we don't
want to penalize the stage if it fails rarely, but its run often enough
that its total failure count goes over the limit.
(this isn't perfect either ...)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]