jiang13021 created SPARK-48847:
----------------------------------

             Summary: Resubmitting a stage without verifying the stage attempt 
number may result in an infinite loop
                 Key: SPARK-48847
                 URL: https://issues.apache.org/jira/browse/SPARK-48847
             Project: Spark
          Issue Type: Bug
          Components: Scheduler
    Affects Versions: 3.5.1, 3.3.2, 3.4.2, 3.2.2
            Reporter: jiang13021


In org.apache.spark.scheduler.DAGScheduler#processShuffleMapStageCompletion
{code:java}
private def processShuffleMapStageCompletion(shuffleStage: ShuffleMapStage): 
Unit = {
    // some code ... 
    
    if (!shuffleStage.isAvailable) {
      // Some tasks had failed; let's resubmit this shuffleStage.
      // TODO: Lower-level scheduler should also deal with this
      logInfo(log"Resubmitting ${MDC(STAGE, shuffleStage)} " +
        log"(${MDC(STAGE_NAME, shuffleStage.name)}) " +
        log"because some of its tasks had failed: " +
        log"${MDC(PARTITION_IDS, 
shuffleStage.findMissingPartitions().mkString(", "))}")
      submitStage(shuffleStage) // resubmit without check
    } else {
      markMapStageJobsAsFinished(shuffleStage)
      submitWaitingChildStages(shuffleStage)
    } 
}{code}
The code above shows that the DAGScheduler will resubmit the stage directly 
without checking if the stage attempt number is greater than 
maxConsecutiveStageAttempts.  However, resubmitting the stage may still lead to 
failure or the stage may continually fail, causing an infinite loop.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to