pan3793 opened a new pull request, #2718:
URL: https://github.com/apache/celeborn/pull/2718

   Backport CELEBORN-1544 (https://github.com/apache/celeborn/pull/2661 and 
https://github.com/apache/celeborn/pull/2663) to branch-0.4 
   
   ### What changes were proposed in this pull request?
   This PR aims to fix a possible memory leak in ShuffleWriter.
   
   ### Why are the changes needed?
   When we turn on `spark.speculation=true` or we kill the executing SQL, the 
task may be interrupted. At this time, `ShuffleWriter` may not call close. 
   At this time, `DataPusher#idleQueue` will occupy some memory capacity ( 
`celeborn.client.push.buffer.max.size` * `celeborn.client.push.queue.capacity` 
) and the instance will not be released.
   
   ```java
   Thread 537 (DataPusher-78931):
     State: TIMED_WAITING
     Blocked count: 0
     Waited count: 16337
     IsDaemon: true
     Stack:
       java.lang.Thread.sleep(Native Method)
       
org.apache.celeborn.client.write.DataPushQueue.takePushTasks(DataPushQueue.java:135)
       org.apache.celeborn.client.write.DataPusher$1.run(DataPusher.java:122)
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Production testing
   
   #### Current 
   <img width="547" alt="image" 
src="https://github.com/user-attachments/assets/d6f64257-144e-4139-96c6-518ca5f1bfd2";>
   
   #### PR
   <img width="479" alt="image" 
src="https://github.com/user-attachments/assets/e4ff62ec-5b9d-47a4-a36c-1d13bf378cbc";>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to