Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16901#discussion_r101061681
  
    --- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -2161,6 +2161,63 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with Timeou
         }
       }
     
    +  test("After fetching failed, success of old attempt of stage should be 
taken as valid.") {
    +    // Create 3 RDDs with shuffle dependencies on each other: rddA <--- 
rddB <--- rddC
    +    val rddA = new MyRDD(sc, 2, Nil)
    +    val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2))
    +    val shuffleIdA = shuffleDepA.shuffleId
    +
    +    val rddB = new MyRDD(sc, 2, List(shuffleDepA))
    +    val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2))
    +
    +    val rddC = new MyRDD(sc, 2, List(shuffleDepB))
    +
    +    submit(rddC, Array(0, 1))
    +    assert(taskSets(0).stageId === 0 && taskSets(0).stageAttemptId === 0)
    +
    +    // Complete both tasks in rddA.
    +    complete(taskSets(0), Seq(
    +      (Success, makeMapStatus("hostA", 2)),
    +      (Success, makeMapStatus("hostA", 2))))
    +
    +    // Fetch failed on hostA for task(partitionId=0) and 
task(partitionId=1) is still running.
    +    runEvent(makeCompletionEvent(
    +      taskSets(1).tasks(0),
    +      FetchFailed(makeBlockManagerId("hostA"), shuffleIdA, 0, 0,
    +        "Fetch failure of task: stageId=1, stageAttempt=0, partitionId=0"),
    +      result = null))
    +
    +    // Both original tasks in rddA should be marked as failed, because 
they ran on the
    +    // failed hostA, so both should be resubmitted. Complete them 
successfully.
    +    scheduler.resubmitFailedStages()
    +    assert(taskSets(2).stageId === 0 && taskSets(2).stageAttemptId === 1
    +      && taskSets(2).tasks.size === 2)
    +    complete(taskSets(2), Seq(
    +      (Success, makeMapStatus("hostB", 2)),
    +      (Success, makeMapStatus("hostB", 2))))
    +
    +    // Both tasks in rddB should be resubmitted, because none of them has 
succeeded.
    +    // Complete the task(partitionId=0) successfully. Task(partitionId=1) 
is still running.
    +    assert(taskSets(3).stageId === 1 && taskSets(3).stageAttemptId === 1 &&
    +      taskSets(3).tasks.size === 2)
    +    runEvent(makeCompletionEvent(
    +      taskSets(3).tasks(0), Success, makeMapStatus("hostB", 2)))
    +
    +    // Complete the task(partition=1) which is from the old 
attempt(stageId=1, stageAttempt=0)
    +    // successfully.
    +    runEvent(makeCompletionEvent(
    +      taskSets(1).tasks(1), Success, makeMapStatus("hostB", 2)))
    +
    +    // Thanks to the success from old attempt of stage(stageId=1, 
stageAttempt=0), there's no
    +    // pending partitions for stage(stageId=1) now, thus downstream stage 
should be submitted,
    +    // though there's still a running task(stageId=1, stageAttempt=1, 
partitionId=1)
    +    // in the active stage attempt.
    +    assert(taskSets.size === 5 && 
taskSets(4).tasks(0).isInstanceOf[ResultTask[_, _]])
    +    complete(taskSets(4), Seq(
    --- End diff --
    
    I was going to suggest adding a check here to make sure that all prior 
tasksetmanagers are marked as zombies.  But (a) you can't check that, since the 
dagscheduler doesn't have a handle on the tasksetmanagers, and (b) more 
importantly, the prior TSMs actually are *not* marked as zombies.  So they may 
continue to submit tasks even though they're not necessary.
    
    I will file a separate bug for that -- its a performance issue, not a 
correctness issue, so not critical.  But this isn't the same as the old "kill 
running tasks when marking a tsm as a zombie" -- in this case, the issue is 
that the tsm may continue to launch *new* tasks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to