Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/5636#discussion_r29487851
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -1085,6 +1085,10 @@ class DAGScheduler(
if (disallowStageRetryForTest) {
abortStage(failedStage, "Fetch failure will not retry stage due
to testing config")
+ } else if (failedStage.failAndShouldAbort()) {
--- End diff --
I think only handling fetch failures is right -- fetch failures are special
cased
[elsewhere](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L660),
to avoid the normal 4 task failures leads to job failure, which is how we get
into the infinite stage retry loop in the first place.
But your totally right about the issue with just 4 task failures, not 4
stage attempt failures -- I did try this out on a workload with more failures
and the job fails at 4 tasks failing, not 4 failed stage attempts.
Unfortunately, I think this is going to make it much harder to solve. After
fetch failures, you can easily end up with multiple concurrent attempts for the
same stage. I don't see an easy work around, since in the `FetchFailed`, you
don't know which attempt it came from, so its not easy to track the unique set
of failed attempts here. Maybe this would be possible in `TaskSchedulerImpl`,
where you know the attempt as well. Any other ideas? Too bad about this, but
thanks for pointing out this issue.
(I think separately we should probably change the fact that stages can have
concurrent attempts, but it would be nice to fix this w/out addressing this.)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]