turboFei opened a new pull request, #3080:
URL: https://github.com/apache/celeborn/pull/3080
What changes were proposed in this pull request?
Do not trigger fetch failure if a spark task attempt is
interrupted(speculation enabled). Do not trigger fetch failure if the RPC of
getReducerFileGroup is timeout. This PR is intended for celeborn-0.5 branch.
Why are the changes needed?
Avoid unnecessary fetch failures and stage re-runs.
Does this PR introduce any user-facing change?
NO.
How was this patch tested?
1. GA.
2. Manually tested on cluster with spark speculation tasks.
Here is the test case
```scala
sc.parallelize(1 to 100, 100).flatMap(i => {
(1 to 150000).iterator.map(num => num)
}).groupBy(i => i, 100)
.map(i => {
if (i._1 < 5) {
Thread.sleep(15000)
}
i
})
.repartition(400).count
```
<img width="1384" alt="截屏2025-01-18 16 16 16"
src="https://github.com/user-attachments/assets/adf64857-5773-4081-a7d0-fa3439e751eb"
/> <img width="1393" alt="截屏2025-01-18 16 16 22"
src="https://github.com/user-attachments/assets/ac9bf172-1ab4-4669-a930-872d009f2530"
/> <img width="1258" alt="截屏2025-01-18 16 19 15"
src="https://github.com/user-attachments/assets/6a8ff3e1-c1fb-4ef2-84d8-b1fc6eb56fa6"
/> <img width="892" alt="截屏2025-01-18 16 17 27"
src="https://github.com/user-attachments/assets/f9de3841-f7d4-4445-99a3-873235d4abd0"
/>
Closes #3070 from FMX/branch-0.5-b1838.
Authored-by: mingji <[email protected]>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]