dao-jun opened a new pull request, #4556:
URL: https://github.com/apache/bookkeeper/pull/4556

   Related to: 
   https://github.com/apache/pulsar/issues/12169
   https://github.com/apache/pulsar/issues/9562
   https://github.com/apache/pulsar/issues/10439
   https://github.com/apache/bookkeeper/pull/3139
   https://github.com/apache/pulsar/issues/14861
   and etc.
   
   ### Background
   
   Our customer has 12 nodes bookie and 12 nodes broker cluster. 
   Pulsar version: 2.6.3
   Bookkeeper: 4.11.1
   
   They enabled bookkeeper client addEntryTimeout feature and set 
`addEntryTimeoutSec` to 30
   
   At first, their EWA is 332, and they encountered Broker OOM exception.
   According to https://github.com/apache/pulsar/issues/12169, we recommended 
them set EWA to 222 and observe for a period of time
   
   After a few days, they also encountered broker OOM exception.
   
   So we suspect that the broker may have a memory leak and let them to enable 
Netty ByteBuf leak detector (Add `-Dpulsar.allocator.leak_detection=Paranoid` 
to their broker vm args and restart).
   
   But when search `LEAK` keyword in their broker logs, their is no related 
logs which means no mem leaks in their broker.
   
   We found some logs `New ensemble: [aaa,bbb] is not adhering to Placement 
Policy. quarantinedBookies: [xxx]` in their logs, and `quarantinedBookies` is 
always same.
   
   We have observed the monitoring of this bookie and found that there has been 
no traffic entering for a long time(weeks), so we tried to restart the bookie, 
but it can be shutdown for a long time, until we `kill -9`, which means this 
bookie maybe ran into thread blocking or sth else so that it can not respond 
requests.
   
   After we restart the bookie, there is no more broker OOM happened, brokers 
goes well.
   
   When I analyze the broker heap dump, I found some Netty channels held a big 
number of DirectMemory, and all this channels connected to that 
quarantinedBookie:
   
![image](https://github.com/user-attachments/assets/656327ec-caf8-4736-a525-4472a3ea08e7)
   There are 6 channels retained over 100MB DirectMemories each. 
   
   Due to our customer enabled `addEntryTimeout` feature, so broker 
Backpressure won't work in this case.
   Enable `busywait` prevent the situation from escalating, but it will not 
solve the root cause.
   If we set EWA to 332, and there is 1 bookie is SLOW or HANGING, OOM can also 
have a chance to happen.
   If we set EWA to 222 and disable `addEntryTimeout`,  and there is 1 bookie 
is SLOW or HANGING, broker maybe can not serve requests.
   
   
   ### Motivation
   
   Fix bookkeeper client can be OOM if there is a bookie is SLOW or HANGING in 
the ensemble.
   
   ### Changes
   
   Close all the channel which connected to a quarantined Bookie to release 
memories.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to