Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/5636#issuecomment-125351990
  
    I think there are two more tests we need.  I think maybe one of these was 
the original intent of that test case which I thought shouldn't be added (or 
maybe that test should get added and I'm forgetting something)
    
    1) If one stage attempt has a bunch failures from one attempt, we should 
proceed with another attempt for that stage.  Eg., add a test that is like your 
existing ones but:
    * stage 1 has 8 tasks `val reduceRdd = new MyRDD(sc, 8, List(shuffleDep))` 
(I think, you should confirm this by looking at the task set and seeing how 
many tasks it has)
    * make attempt 0 of stage 1 fail **all 8** tasks with fetch failures
    * have attempt 1 of stage 0 & stage 1 complete with no failures
    * everything should be happy
    
    2) If a stage fails a few times with fetch failures, then succeeds, then 
fails again a few times, it should be allowed to go over the 4 failure limit in 
total.  This is a little more complicated, you will need to add a test with 3 
stages.  You can follow [this 
example](https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala#L808).
  Then you would
    * complete stage 0 successfully
    * go through some iterations of failing stage 1, retrying stage 0 & stage 
1, so get 3 failures total of stage 1.
    * on attempt 4 of stage 1, have it succeed
    * then try stage 2, but it should have one fetch failure
    * that will result in a 5th attempt for stage 1.  Go through another round 
of failures for stage 1, retrying stage 0 & 1, so that you get *another* 3 
rounds of fetch failures on stage 1.
    * Finally complete the next attempt on stage 1 successfully.  (I guess 
thats 8 attempts total.)
    * Then complete the next attempt of stage 2 successffully.  Then check the 
job has completed successfully.
    
    
    Thanks for picking this up again Ilya.  Sorry its a lot work on writing 
these tests, but this is really appreciated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to