seayoun opened a new pull request #27211: [SPARK-30325][CORE] 
markPartitionCompleted cause task status inconsistent
URL: https://github.com/apache/spark/pull/27211
 
 
   ### **What changes were proposed in this pull request?**
    Fix task status inconsistent in `executorLost` which caused by 
`markPartitionCompleted`
   
   ### **Why are the changes needed?**
   The inconsistent will cause app hung up.
   The bugs occurs in the corer case as follows:
   1. The stage occurs during stage retry, scheduler will resubmit a new stage 
with unfinished tasks.
   2. Those unfinished tasks in origin stage finished and the same task on the 
new retry stage hasn't finished, it will mark the task partition on the current 
retry stage as succesuful in TSM `successful` array variable. 
   3. The executor crashed when it is running tasks which have succeeded by 
origin stage, it cause TSM run `executorLost` to rescheduler the task on the 
executor, and it will change the partition's running status in `copiesRunning` 
twice to -1.
   4. 'dequeueTaskFromList' will use `copiesRunning` equal 0 as reschedule 
basis when rescheduler tasks, and now it is -1, can't to reschedule, and the 
app will hung forever.
   
   ### **Does this PR introduce any user-facing change?**
   No
   
   ### **How was this patch tested?**

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to