Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5636#discussion_r29488352
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
    @@ -1085,6 +1085,10 @@ class DAGScheduler(
     
             if (disallowStageRetryForTest) {
               abortStage(failedStage, "Fetch failure will not retry stage due 
to testing config")
    +        } else if (failedStage.failAndShouldAbort()) {
    --- End diff --
    
    Ah I see your point re:fetch failures: put differently, fetch failures are 
the only case where we'd have a stage fail and then have reason to re-try it, 
so it's the only case where this logic even makes sense.
    
    I didn't realize you can end up with multiple concurrent attempts for the 
same stage; what's the code path that leads to that happening?  Doesn't the 
submitStage function make sure the stage isn't already running before 
submitting it?
    
    I was imagining an easy fix to the issue I mentioned, which is to just 
check if the failedStage is already in failedStages, and only increment the 
failure counter in that case.  However, I don't think that works, because 
another FetchFailed exception for the same stage attempt could come in *after* 
resubmitFailedStages has been called (which clears failedStages).  Can we just 
keep a set of the failed stage attempt IDs in the Stage object (since we have 
access to the stage attempt ID via Stage.latestInfo)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to