Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5636#discussion_r29501457
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
    @@ -1085,6 +1085,10 @@ class DAGScheduler(
     
             if (disallowStageRetryForTest) {
               abortStage(failedStage, "Fetch failure will not retry stage due 
to testing config")
    +        } else if (failedStage.failAndShouldAbort()) {
    --- End diff --
    
    The trick is, when you get the `FetchFailed`, you don't know which attempt 
that exception came from (unless I'm missing it?).  You only know the latest 
attempt for the stage.  Just looking at the latest attempt can lead to 
miscounting.  Say attempt 0 has multiple failures -- the first failure will 
have you count attempt 0, and lead to attempt 1 getting submitted.  But then 
more failures from attempt 0 come in after that.  If you look at the current 
attempt, you'd end up counting a failure for attempt 1 as well.
    
    It gets much worse with concurrent attempts for the same stage.  Attempts 0 
- 5 might all be running, they could all be generating fetch failures, but 
you'd only count it as attempt 5 that failed.  Eventually you'll count 4 failed 
attempts, but you might actually go through a lot more attempts in the process. 
  One alternative might be to track the current id the last time the stage was 
successful, and compare that to the current id on every fetch failure.  If it 
differs by more than 4, then abort.  Feels a little hacky, though, I worry 
there are cases I haven't thought of.  Maybe we're better with the set based 
approach, despite the miscounting -- its more understandable anyway.
    
    yeah, multiple concurrent attempts surprised me too.  Its not just 
theoretical -- we've seen this from customers, with lots of attempts.  In fact 
its one of the most important reasons I think we need to limit the number of 
stage retries.  The strange thing is, when I reproduce it on my own, somehow 
things just work out, so I thought maybe this was as designed and I was still 
trying to make sense of it.  But if you are surprised as well, maybe this is a 
serious bug and its just dumb luck that it happens to work -- I'll open a jira 
shortly and be sure to cc you.  (and though it works out OK in my test 
reproduction, in the customer cases there was some really crazy behavior, 
though I haven't been able to figure out yet if its related or not.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to