turboFei opened a new pull request, #3080:
URL: https://github.com/apache/celeborn/pull/3080

   What changes were proposed in this pull request?
   Do not trigger fetch failure if a spark task attempt is 
interrupted(speculation enabled). Do not trigger fetch failure if the RPC of 
getReducerFileGroup is timeout. This PR is intended for celeborn-0.5 branch.
   
   Why are the changes needed?
   Avoid unnecessary fetch failures and stage re-runs.
   
   Does this PR introduce any user-facing change?
   NO.
   
   How was this patch tested?
   1. GA.
   2. Manually tested on cluster with spark speculation tasks.
   
   Here is the test case
   ```scala
   sc.parallelize(1 to 100, 100).flatMap(i => {
           (1 to 150000).iterator.map(num => num)
         }).groupBy(i => i, 100)
         .map(i => {
           if (i._1 < 5) {
             Thread.sleep(15000)
           }
           i
         })
         .repartition(400).count
   ```
   
   <img width="1384" alt="截屏2025-01-18 16 16 16" 
src="https://github.com/user-attachments/assets/adf64857-5773-4081-a7d0-fa3439e751eb";
 /> <img width="1393" alt="截屏2025-01-18 16 16 22" 
src="https://github.com/user-attachments/assets/ac9bf172-1ab4-4669-a930-872d009f2530";
 /> <img width="1258" alt="截屏2025-01-18 16 19 15" 
src="https://github.com/user-attachments/assets/6a8ff3e1-c1fb-4ef2-84d8-b1fc6eb56fa6";
 /> <img width="892" alt="截屏2025-01-18 16 17 27" 
src="https://github.com/user-attachments/assets/f9de3841-f7d4-4445-99a3-873235d4abd0";
 />
   
   Closes #3070 from FMX/branch-0.5-b1838.
   
   Authored-by: mingji <[email protected]>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to