Github user ilganeli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5636#discussion_r29615914
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
    @@ -1085,6 +1085,10 @@ class DAGScheduler(
     
             if (disallowStageRetryForTest) {
               abortStage(failedStage, "Fetch failure will not retry stage due 
to testing config")
    +        } else if (failedStage.failAndShouldAbort()) {
    --- End diff --
    
    All - I realized that simply counting attemptIds will not be enough. There 
are two scenarios:
    1) Concurrent failures of a FetchFailed task in a stage
    2) Sequential failures of a stage due to a single task failing in sequence.
    
    If all we cared about was counting the number of distinct concurrent 
failures, keeping a Set would suffice. However, we can't use attemptId because 
it's reset between sequential stage executions e.g. between attempt 1 and 
attempt 2. 
    
    Thus, I think the solution is to have a ```HashMap[StageFailureCount, 
StageAttemptIds] hashMap```. The logic for determining whether to abort is thus 
to have a) ```hashMap.size() > 4``` OR 
    b) ```hashMap(i).size() > 4```.
    
    Does this seem reasonable? The above scenario came out when I was running 
my two tests (which simulate conditions (1) and (2).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to