leixm opened a new pull request, #3556:
URL: https://github.com/apache/celeborn/pull/3556

   ### What changes were proposed in this pull request?
   Retrieve the number of task failures from TaskSetManager in 
SparkUtils#shouldReportShuffleFetchFailure method
   
   
   ### Why are the changes needed?
   
https://github.com/apache/celeborn/blob/main/client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java#L484
 We record the failure counts for task attempts in the "UNKNOWN" and "FAILED" 
states, but spark might not record the failure counts for task attempts in the 
FAILED state. This is a common occurrence in our production environment where 
task attempts fail due to container preemption. This situation happens 
frequently and failure counts should not be recorded, as existing code logic 
makes it easier for stageRerun to be triggered prematurely. Therefore, 
obtaining the failure counts for task attempts from the taskSetManager would be 
more accurate.
   
   ### Does this PR resolve a correctness bug?
   No.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Existing UTs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to