pan3793 opened a new pull request, #2718: URL: https://github.com/apache/celeborn/pull/2718
Backport CELEBORN-1544 (https://github.com/apache/celeborn/pull/2661 and https://github.com/apache/celeborn/pull/2663) to branch-0.4 ### What changes were proposed in this pull request? This PR aims to fix a possible memory leak in ShuffleWriter. ### Why are the changes needed? When we turn on `spark.speculation=true` or we kill the executing SQL, the task may be interrupted. At this time, `ShuffleWriter` may not call close. At this time, `DataPusher#idleQueue` will occupy some memory capacity ( `celeborn.client.push.buffer.max.size` * `celeborn.client.push.queue.capacity` ) and the instance will not be released. ```java Thread 537 (DataPusher-78931): State: TIMED_WAITING Blocked count: 0 Waited count: 16337 IsDaemon: true Stack: java.lang.Thread.sleep(Native Method) org.apache.celeborn.client.write.DataPushQueue.takePushTasks(DataPushQueue.java:135) org.apache.celeborn.client.write.DataPusher$1.run(DataPusher.java:122) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Production testing #### Current <img width="547" alt="image" src="https://github.com/user-attachments/assets/d6f64257-144e-4139-96c6-518ca5f1bfd2"> #### PR <img width="479" alt="image" src="https://github.com/user-attachments/assets/e4ff62ec-5b9d-47a4-a36c-1d13bf378cbc"> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
