seayoun opened a new pull request #27211: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent URL: https://github.com/apache/spark/pull/27211 ### **What changes were proposed in this pull request?** Fix task status inconsistent in `executorLost` which caused by `markPartitionCompleted` ### **Why are the changes needed?** The inconsistent will cause app hung up. The bugs occurs in the corer case as follows: 1. The stage occurs during stage retry, scheduler will resubmit a new stage with unfinished tasks. 2. Those unfinished tasks in origin stage finished and the same task on the new retry stage hasn't finished, it will mark the task partition on the current retry stage as succesuful in TSM `successful` array variable. 3. The executor crashed when it is running tasks which have succeeded by origin stage, it cause TSM run `executorLost` to rescheduler the task on the executor, and it will change the partition's running status in `copiesRunning` twice to -1. 4. 'dequeueTaskFromList' will use `copiesRunning` equal 0 as reschedule basis when rescheduler tasks, and now it is -1, can't to reschedule, and the app will hung forever. ### **Does this PR introduce any user-facing change?** No ### **How was this patch tested?**
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
