Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5636#discussion_r29487851
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
    @@ -1085,6 +1085,10 @@ class DAGScheduler(
     
             if (disallowStageRetryForTest) {
               abortStage(failedStage, "Fetch failure will not retry stage due 
to testing config")
    +        } else if (failedStage.failAndShouldAbort()) {
    --- End diff --
    
    I think only handling fetch failures is right -- fetch failures are special 
cased 
[elsewhere](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L660),
 to avoid the normal 4 task failures leads to job failure, which is how we get 
into the infinite stage retry loop in the first place.
    
    But your totally right about the issue with just 4 task failures, not 4 
stage attempt failures -- I did try this out on a workload with more failures 
and the job fails at 4 tasks failing, not 4 failed stage attempts.  
Unfortunately, I think this is going to make it much harder to solve.  After 
fetch failures, you can easily end up with multiple concurrent attempts for the 
same stage.  I don't see an easy work around, since in the `FetchFailed`, you 
don't know which attempt it came from, so its not easy to track the unique set 
of failed attempts here.  Maybe this would be possible in `TaskSchedulerImpl`, 
where you know the attempt as well.  Any other ideas?  Too bad about this, but 
thanks for pointing out this issue.
    
    (I think separately we should probably change the fact that stages can have 
concurrent attempts, but it would be nice to fix this w/out addressing this.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to