[ https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Penglei Shi updated SPARK-40082: -------------------------------- Description: In condition of push-based shuffle being enabled and speculative tasks existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, then its parent stages will be resubmitting firstly and it will cost some time to compute. Before the shuffleMapStage being resubmitting, its all speculative tasks success and register map output, but task successful events can not trigger shuffleMergeFinalized because this stage has been remove from runningStages Then this stage is resubmitted, but speculative tasks have registered map output and there are no missing tasks to compute, resubmitting stages will also not trigger shuffleMergeFinalized. Eventually this stage‘s _shuffleMergedFinalized keeps false. Then AQE will submit next stages which are dependent on this shuffleMapStage occurring fetchFailed. And in getMissingParentStages, this stage will be marked as missing and will being resubmitted, but next stages are added after this stage being finished, so next stages will not be submitted even though this stage's resubmitting has been finished. I have only met some times in my production env and it is difficult to reproduce。 was: In condition of push-based shuffle being enabled and speculative tasks existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, then its parent stages will be resubmitting firstly and it will cost some time to compute. Before the shuffleMapStage being resubmitting, its all speculative tasks success and register map output, but task successful events can not trigger shuffleMergeFinalized because this stage has been remove from runningStages !image-2022-08-15-17-17-08-666.png! Then this stage is resubmitted, but speculative tasks have registered map output and there are no missing tasks to compute, resubmitting stages will also not trigger shuffleMergeFinalized. Eventually this stage‘s _shuffleMergedFinalized keeps false. !image-2022-08-15-17-17-49-488.png! Then AQE will submit next stages which are dependent on this shuffleMapStage occurring fetchFailed. And in getMissingParentStages, this stage will be marked as missing and will being resubmitted, but next stages are added after this stage being finished, so next stages will not be submitted even though this stage's resubmitting has been finished. !image-2022-08-15-17-15-39-992.png! I have only met some times in my production env and it is difficult to reproduce。 > DAGScheduler may not schduler new stage in condition of push-based shuffle > enabled > ---------------------------------------------------------------------------------- > > Key: SPARK-40082 > URL: https://issues.apache.org/jira/browse/SPARK-40082 > Project: Spark > Issue Type: Bug > Components: Scheduler > Affects Versions: 3.1.1 > Reporter: Penglei Shi > Priority: Major > Attachments: missParentStages.png, shuffleMergeFinalized.png, > submitMissingTasks.png > > > In condition of push-based shuffle being enabled and speculative tasks > existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, > then its parent stages will be resubmitting firstly and it will cost some > time to compute. Before the shuffleMapStage being resubmitting, its all > speculative tasks success and register map output, but task successful events > can not trigger shuffleMergeFinalized because this stage has been remove > from runningStages > Then this stage is resubmitted, but speculative tasks have registered map > output and there are no missing tasks to compute, resubmitting stages will > also not trigger shuffleMergeFinalized. Eventually this stage‘s > _shuffleMergedFinalized keeps false. > Then AQE will submit next stages which are dependent on this shuffleMapStage > occurring fetchFailed. And in getMissingParentStages, this stage will be > marked as missing and will being resubmitted, but next stages are added after > this stage being finished, so next stages will not be submitted even though > this stage's resubmitting has been finished. > > I have only met some times in my production env and it is difficult to > reproduce。 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org