jiang13021 created SPARK-48847:
----------------------------------
Summary: Resubmitting a stage without verifying the stage attempt
number may result in an infinite loop
Key: SPARK-48847
URL: https://issues.apache.org/jira/browse/SPARK-48847
Project: Spark
Issue Type: Bug
Components: Scheduler
Affects Versions: 3.5.1, 3.3.2, 3.4.2, 3.2.2
Reporter: jiang13021
In org.apache.spark.scheduler.DAGScheduler#processShuffleMapStageCompletion
{code:java}
private def processShuffleMapStageCompletion(shuffleStage: ShuffleMapStage):
Unit = {
// some code ...
if (!shuffleStage.isAvailable) {
// Some tasks had failed; let's resubmit this shuffleStage.
// TODO: Lower-level scheduler should also deal with this
logInfo(log"Resubmitting ${MDC(STAGE, shuffleStage)} " +
log"(${MDC(STAGE_NAME, shuffleStage.name)}) " +
log"because some of its tasks had failed: " +
log"${MDC(PARTITION_IDS,
shuffleStage.findMissingPartitions().mkString(", "))}")
submitStage(shuffleStage) // resubmit without check
} else {
markMapStageJobsAsFinished(shuffleStage)
submitWaitingChildStages(shuffleStage)
}
}{code}
The code above shows that the DAGScheduler will resubmit the stage directly
without checking if the stage attempt number is greater than
maxConsecutiveStageAttempts. However, resubmitting the stage may still lead to
failure or the stage may continually fail, causing an infinite loop.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]