FMX opened a new pull request, #3346:
URL: https://github.com/apache/celeborn/pull/3346

   ### What changes were proposed in this pull request?
   1. To limit the max size of `timerMetrics` in abstractSource.
   2. Fix a concurrent issue that may cause elements to be left in the timer 
metrics.
   3. To fix an issue that will cause the worker to run out of memory if the 
metrics are not captured for a long time.
   
   
   ### Why are the changes needed?
   A long-running worker ran out of memory and found out that the metrics are 
huge in the heap dump.
   As you can see below, the biggest object is the time metric queue, and I got 
1.6 million records.
   <img width="1516" alt="Screenshot 2025-06-24 at 09 59 30" 
src="https://github.com/user-attachments/assets/691c7bc2-b974-4cc0-8d5a-bf626ab903c0";
 />
   <img width="1239" alt="Screenshot 2025-06-24 at 14 45 10" 
src="https://github.com/user-attachments/assets/ebdf5a4d-c941-4f1e-911f-647aa156b37a";
 />
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   NO.
   
   
   ### How was this patch tested?
   Cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to