gaoyajun02 opened a new pull request, #3339:
URL: https://github.com/apache/celeborn/pull/3339

   
   ### What changes were proposed in this pull request?
   This PR fixes the FetchFailure handling logic in 
shouldReportShuffleFetchFailure method to properly handle cases where 
TaskSetManager cannot be found for a given task ID.
   
   ### Why are the changes needed?
   The current implementation incorrectly reports FetchFailure when 
TaskSetManager is not found, which leads to false positive failures in normal 
fault tolerance scenarios. This happens because:
   1. Executor Lost scenarios: When executors are lost due to resource 
preemption or failures, the associated TaskSetManager gets cleaned up, making 
it unavailable for lookup
   2. Stage cancellation: Cancelled or completed stages may have their 
TaskSetManager removed
   These are all normal scenarios in Spark's fault tolerance mechanism and 
should not be treated as shuffle failures. The current behavior can cause 
unnecessary job failures and confusion in debugging actual shuffle issues.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   UT, Regression Testing
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to