leixm opened a new pull request, #3556: URL: https://github.com/apache/celeborn/pull/3556
### What changes were proposed in this pull request? Retrieve the number of task failures from TaskSetManager in SparkUtils#shouldReportShuffleFetchFailure method ### Why are the changes needed? https://github.com/apache/celeborn/blob/main/client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java#L484 We record the failure counts for task attempts in the "UNKNOWN" and "FAILED" states, but spark might not record the failure counts for task attempts in the FAILED state. This is a common occurrence in our production environment where task attempts fail due to container preemption. This situation happens frequently and failure counts should not be recorded, as existing code logic makes it easier for stageRerun to be triggered prematurely. Therefore, obtaining the failure counts for task attempts from the taskSetManager would be more accurate. ### Does this PR resolve a correctness bug? No. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
