jiangxb1987 commented on issue #26975: [SPARK-30325][CORE] markPartitionCompleted cause task status inconsistent URL: https://github.com/apache/spark/pull/26975#issuecomment-569365741 There are multiple corner cases not handled by current solution: Image we have two TSMs (M1 and M2) working on the same Stage, and for the corresponding tasks are notated as T1 and T2 for a specific partition: 1. T1 and T2 might be scheduled on different executors (E1 and E2), both tasks have been finished. Then E2 get lost, in the approach suggested by this PR, the partition in M2 will be marked as not successful and a new pending task would be added, which is actually not necessary because the shuffle files are on E1; 2. T1 and T2 might be scheduled on the same executor, T1 has been finished but T2 is still running. Then the executor get lost, since T2 is still running the partition will not be marked as not successful. After a while maybe another task finished and mark the TSM as finished, but actually the shuffle files get lost, thus it lead to a new regression. I haven't get a solution here. I'm thinking whether we can put the successful task information into taskInfos inside markPartitionCompleted, if this is possible then the second problem I mentioned above could probably get resolved.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
