[PR] [CELEBORN-1792] Congestion and MemoryManager should use pinnedDirectMemory instead of usedDirectMemory [celeborn]

via GitHub Fri, 20 Dec 2024 02:17:11 -0800


leixm opened a new pull request, #3018:
URL: https://github.com/apache/celeborn/pull/3018


   ### What changes were proposed in this pull request?
   Congestion and MemoryManager should use pinnedDirectMemory instead of 
usedDirectMemory
   
   
   ### Why are the changes needed?
   In our production environment, after worker pausing, the usedDirectMemory 
keep high and does not decrease. The worker node is permanently blacklisted and 
cannot be used.
   
   This problem has been bothering us for a long time. When the thred cache is 
turned off, in fact, **after ctx.channel().config().setAutoRead(false), the 
netty framework will still hold some ByteBufs**. This part of ByteBuf allocated 
from chunk in PoolArena which can not be released.
   
   In netty, if a chunk is 16M and 8k of this chunk has been allocated, then 
the pinnedMemory is 8k and the activeMemory is 16M. The remaining (16M-8k) 
memory can be allocated, but not yet allocated, netty allocates and releases 
memory in chunk units, so the 8k that has been allocated will result in 16M 
that cannot be returned to the operating system.
   
   Here are some scenes from our production/test environment:
   
   We config 10gb off-heap memory for worker, other configs as below:
   ```
   celeborn.network.memory.allocator.allowCache                         false
   celeborn.worker.monitor.memory.check.interval                         100ms
   celeborn.worker.monitor.memory.report.interval                        10s
   celeborn.worker.directMemoryRatioToPauseReceive                       0.75
   celeborn.worker.directMemoryRatioToPauseReplicate                     0.85
   celeborn.worker.directMemoryRatioToResume                             0.5
   ```
   
   When receiving high traffic, the worker's usedDirectMemory increases. After 
triggering trim and pause, usedDirectMemory still does not reach the resume 
threshold, and worker was excluded.
   
   
![image](https://github.com/user-attachments/assets/40f5609e-fbf9-4841-84ec-69a69256edf4)
   
   So we checked the heap snapshot of the abnormal worker, we can see that 
there are a large number of DirectByteBuffers in the heap memory. These 
DirectByteBuffers are all 4mb in size, which is exactly the size of chunksize. 
According to the path to gc root, DirectByteBuffer is held by PoolChunk, and 
these 4m only have 160k pinnedBytes.
   
   
![image](https://github.com/user-attachments/assets/3d755ef3-164c-4b5b-bec1-aaf039c0c0a5)
   
   
![image](https://github.com/user-attachments/assets/17907753-2f42-4617-a95e-1ee980553fb9)
   
   There are many ByteBufs that are not released
   
   
![image](https://github.com/user-attachments/assets/b87eb1a9-313f-4f42-baa8-227fd49c19b6)
   
   The stack shows that these ByteBufs are allocated by netty
   
![image](https://github.com/user-attachments/assets/f8783f99-507a-44a8-9de5-7215a5eed1db)
   
   We tried to reproduce this situation in the test environment. When the same 
problem occurred, we added a restful api of the worker to force the worker to 
resume. After the resume, the worker returned to normal, and PushDataHandler 
handled many delayed requests.
   
   
![image](https://github.com/user-attachments/assets/be37039b-97b8-4ae8-a64f-a2003bea613e)
   
   
![image](https://github.com/user-attachments/assets/24b1c8ad-131c-4bd6-adcb-bad655cfbdbf)
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Existing UTs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [CELEBORN-1792] Congestion and MemoryManager should use pinnedDirectMemory instead of usedDirectMemory [celeborn]

Reply via email to