[ 
https://issues.apache.org/jira/browse/SPARK-40082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Penglei Shi updated SPARK-40082:
--------------------------------
    Description: 
In condition of push-based shuffle being enabled and speculative tasks 
existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, 
then its parent stages will be resubmitting firstly and it will cost some time 
to compute. Before the shuffleMapStage being resubmitting, its all speculative 
tasks success and register map output, but task successful events can not 
trigger shuffleMergeFinalized because this stage has been  remove from 
runningStages 

Then this stage is resubmitted, but speculative tasks have registered map 
output and there are no missing tasks to compute, resubmitting stages will also 
not trigger shuffleMergeFinalized. Eventually this stage‘s 
_shuffleMergedFinalized keeps false.

Then AQE will submit next stages which are dependent on  this shuffleMapStage 
occurring fetchFailed. And in getMissingParentStages, this stage will be marked 
as missing and will being resubmitted, but next stages are added after this 
stage being finished, so next stages will not be submitted even though this 
stage's resubmitting has been finished.

 

I have only met some times in my production env and it is difficult to 
reproduce。

  was:
In condition of push-based shuffle being enabled and speculative tasks 
existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, 
then its parent stages will be resubmitting firstly and it will cost some time 
to compute. Before the shuffleMapStage being resubmitting, its all speculative 
tasks success and register map output, but task successful events can not 
trigger shuffleMergeFinalized because this stage has been  remove from 
runningStages 

!image-2022-08-15-17-17-08-666.png!

Then this stage is resubmitted, but speculative tasks have registered map 
output and there are no missing tasks to compute, resubmitting stages will also 
not trigger shuffleMergeFinalized. Eventually this stage‘s 
_shuffleMergedFinalized keeps false.

!image-2022-08-15-17-17-49-488.png!

Then AQE will submit next stages which are dependent on  this shuffleMapStage 
occurring fetchFailed. And in getMissingParentStages, this stage will be marked 
as missing and will being resubmitted, but next stages are added after this 
stage being finished, so next stages will not be submitted even though this 
stage's resubmitting has been finished.

!image-2022-08-15-17-15-39-992.png!

 

I have only met some times in my production env and it is difficult to 
reproduce。


> DAGScheduler may not schduler new stage in condition of push-based shuffle 
> enabled
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-40082
>                 URL: https://issues.apache.org/jira/browse/SPARK-40082
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 3.1.1
>            Reporter: Penglei Shi
>            Priority: Major
>         Attachments: missParentStages.png, shuffleMergeFinalized.png, 
> submitMissingTasks.png
>
>
> In condition of push-based shuffle being enabled and speculative tasks 
> existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, 
> then its parent stages will be resubmitting firstly and it will cost some 
> time to compute. Before the shuffleMapStage being resubmitting, its all 
> speculative tasks success and register map output, but task successful events 
> can not trigger shuffleMergeFinalized because this stage has been  remove 
> from runningStages 
> Then this stage is resubmitted, but speculative tasks have registered map 
> output and there are no missing tasks to compute, resubmitting stages will 
> also not trigger shuffleMergeFinalized. Eventually this stage‘s 
> _shuffleMergedFinalized keeps false.
> Then AQE will submit next stages which are dependent on  this shuffleMapStage 
> occurring fetchFailed. And in getMissingParentStages, this stage will be 
> marked as missing and will being resubmitted, but next stages are added after 
> this stage being finished, so next stages will not be submitted even though 
> this stage's resubmitting has been finished.
>  
> I have only met some times in my production env and it is difficult to 
> reproduce。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to