gaoyajun02 opened a new pull request, #3339: URL: https://github.com/apache/celeborn/pull/3339
### What changes were proposed in this pull request? This PR fixes the FetchFailure handling logic in shouldReportShuffleFetchFailure method to properly handle cases where TaskSetManager cannot be found for a given task ID. ### Why are the changes needed? The current implementation incorrectly reports FetchFailure when TaskSetManager is not found, which leads to false positive failures in normal fault tolerance scenarios. This happens because: 1. Executor Lost scenarios: When executors are lost due to resource preemption or failures, the associated TaskSetManager gets cleaned up, making it unavailable for lookup 2. Stage cancellation: Cancelled or completed stages may have their TaskSetManager removed These are all normal scenarios in Spark's fault tolerance mechanism and should not be treated as shuffle failures. The current behavior can cause unnecessary job failures and confusion in debugging actual shuffle issues. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT, Regression Testing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
