devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-848967270


   We discovered that certain non-default broker.conf settings seem to be key 
to reproducing this bug, but we haven't yet identified which of them are 
responsible. 
   ```
       maxUnackedMessagesPerConsumer=0
       maxUnackedMessagesPerSubscription=0
       bookkeeperEnableStickyReads=true
       managedLedgerMaxEntriesPerLedger=10
       managedLedgerMinLedgerRolloverTimeMinutes=0
       managedLedgerReadEntryTimeoutSeconds=120
       managedLedgerAddEntryTimeoutSeconds=120
   ```
   It's obvious that `managedLedgerMaxEntriesPerLedger=10` puts significant 
load on the bookies and results in a high volume of rollovers, but 
interestingly, we weren't able to reproduce the freeze with only these configs:
   
   ```
       bookkeeperEnableStickyReads=true
       managedLedgerMaxEntriesPerLedger=10
       managedLedgerMinLedgerRolloverTimeMinutes=0
       managedLedgerReadEntryTimeoutSeconds=120
       managedLedgerAddEntryTimeoutSeconds=120
   ```
   
   It was only after adding these that the bug re-appeared:
   ```
       maxUnackedMessagesPerConsumer=0
       maxUnackedMessagesPerSubscription=0
   ```
   
   We also noticed an ERROR message on the affected broker that consistently 
appears (in Pulsar 2.6.2) at the exact time it freezes:
   
   `2021-05-25T21:11:05,208 [bookkeeper-ml-scheduler-OrderedScheduler-6-0] 
ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add 
entry for ledger -1 in time-out 120 sec`
   
   While it's frozen, after waiting for a while, a series of similar errors 
will appear on that same broker but with an actual (not -1) ledgerId:
   
   ```
       2021-05-25T21:15:05,208 [bookkeeper-ml-scheduler-OrderedScheduler-6-0] 
ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add 
entry for ledger 478895 in time-out 120 sec
       2021-05-25T21:17:05,208 [bookkeeper-ml-scheduler-OrderedScheduler-6-0] 
ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add 
entry for ledger 478895 in time-out 120 sec
       2021-05-25T21:19:05,208 [bookkeeper-ml-scheduler-OrderedScheduler-6-0] 
ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add 
entry for ledger 478895 in time-out 120 sec
       2021-05-25T21:21:05,208 [bookkeeper-ml-scheduler-OrderedScheduler-6-0] 
ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add 
entry for ledger 478895 in time-out 120 sec
   ```
   
   As soon as that broker is stopped, data will resume flowing briefly before 
freezing again on a different broker, which will show the error `ERROR 
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add entry for 
ledger -1 in time-out 120 sec` when it freezes. 
   
   It looks like this issue is closely related to 
https://github.com/apache/bookkeeper/issues/2716
   
   We also noticed no clear JVM memory issues, though a few minutes _after_ the 
freeze occurs, we did notice a memory leak that showed up after we added 
`-Dio.netty.leakDetectionLevel=advanced 
-Dpulsar.allocator.leak_detection=Advanced` arguments to `PULSAR_MEM`
   At this point, it looks like the leak is a symptom, rather than a root 
cause. It also occurs after several of those `ERROR 
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add entry for 
ledger 478895 in time-out 120 sec` logs have appeared on that broker. The leak 
message was `ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() 
was not called before it's garbage-collected.`
   
   The WARN-level logs around the leak are 
[leak_log.txt](https://github.com/apache/pulsar/files/6548597/leak_log.txt).
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to