cloud-fan opened a new pull request #24359: Revert 
"[SPARK-23433][SPARK-25250][CORE] Later created TaskSet should learn about the 
finished partitions
URL: https://github.com/apache/spark/pull/24359
 
 
   ## What changes were proposed in this pull request?
   
   Our customer has a very complicated job. Sometimes it successes and 
sometimes it fails with
   ```
   Caused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: ShuffleMapStage 4  has failed the maximum allowable number of times: 4.
   Most recent failure reason: org.apache.spark.shuffle.FetchFailedException
   ```
   
   However, with the patch https://github.com/apache/spark/pull/23871 , the job 
hangs forever.
   
   When I investigated it, I found that `DAGScheduler` and `TaskSchedulerImpl` 
define stage completion differently. `DAGScheduler` thinks a stage is completed 
if all its partitions are marked as completed ([result 
stage](https://github.com/apache/spark/blob/v2.4.1/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1362-L1368)
 and [shuffle 
stage](https://github.com/apache/spark/blob/v2.4.1/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1400)).
 `TaskSchedulerImpl` thinks a stage's task set is completed when all tasks 
finish (see the 
[code](https://github.com/apache/spark/blob/v2.4.1/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L779-L784)).
   
   Ideally this two definition should be consistent, but #23871 breaks it. In 
our customer's Spark log, I found that, a stage's task set completes, but the 
stage never completes. More specifically, `DAGScheduler` submits a task set for 
stage 4.1 with 1000 tasks, but the `TaskSetManager` skips to run the first 100 
tasks. Later on, `TaskSetManager` finishes 900 tasks and marks the task set as 
completed. However, `DAGScheduler` doesn't agree with it and hangs forever, 
waiting for more task completion events of stage 4.1.
   
   With hindsight, I think `TaskSchedulerIImpl.stageIdToFinishedPartitions` is 
fragile. We need to pay more effort to make sure this is consistent with 
`DAGScheduler`'s knowledge. When `DAGScheduler` marks some partitions from 
finished to unfinished, `TaskSchedulerIImpl.stageIdToFinishedPartitions` should 
be updated as well.
   
   This PR reverts #23871, let's think of a more robust idea later.
   
   ## How was this patch tested?
   
   N/A
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to