devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-848967270
We discovered that certain non-default broker.conf settings seem to be key
to reproducing this bug, but we haven't yet identified which of them are
responsible.
```
maxUnackedMessagesPerConsumer=0
maxUnackedMessagesPerSubscription=0
bookkeeperEnableStickyReads=true
managedLedgerMaxEntriesPerLedger=10
managedLedgerMinLedgerRolloverTimeMinutes=0
managedLedgerReadEntryTimeoutSeconds=120
managedLedgerAddEntryTimeoutSeconds=120
```
It's obvious that `managedLedgerMaxEntriesPerLedger=10` puts significant
load on the bookies and results in a high volume of rollovers, but
interestingly, we weren't able to reproduce the freeze with only these configs:
```
bookkeeperEnableStickyReads=true
managedLedgerMaxEntriesPerLedger=10
managedLedgerMinLedgerRolloverTimeMinutes=0
managedLedgerReadEntryTimeoutSeconds=120
managedLedgerAddEntryTimeoutSeconds=120
```
It was only after adding these that the bug re-appeared:
```
maxUnackedMessagesPerConsumer=0
maxUnackedMessagesPerSubscription=0
```
We also noticed an ERROR message on the affected broker that consistently
appears (in Pulsar 2.6.2) at the exact time it freezes:
`2021-05-25T21:11:05,208 [bookkeeper-ml-scheduler-OrderedScheduler-6-0]
ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add
entry for ledger -1 in time-out 120 sec`
While it's frozen, after waiting for a while, a series of similar errors
will appear on that same broker but with an actual (not -1) ledgerId:
```
2021-05-25T21:15:05,208 [bookkeeper-ml-scheduler-OrderedScheduler-6-0]
ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add
entry for ledger 478895 in time-out 120 sec
2021-05-25T21:17:05,208 [bookkeeper-ml-scheduler-OrderedScheduler-6-0]
ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add
entry for ledger 478895 in time-out 120 sec
2021-05-25T21:19:05,208 [bookkeeper-ml-scheduler-OrderedScheduler-6-0]
ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add
entry for ledger 478895 in time-out 120 sec
2021-05-25T21:21:05,208 [bookkeeper-ml-scheduler-OrderedScheduler-6-0]
ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add
entry for ledger 478895 in time-out 120 sec
```
As soon as that broker is stopped, data will resume flowing briefly before
freezing again on a different broker, which will show the error `ERROR
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add entry for
ledger -1 in time-out 120 sec` when it freezes.
It looks like this issue is closely related to
https://github.com/apache/bookkeeper/issues/2716
We also noticed no clear JVM memory issues, though a few minutes _after_ the
freeze occurs, we did notice a memory leak that showed up after we added
`-Dio.netty.leakDetectionLevel=advanced
-Dpulsar.allocator.leak_detection=Advanced` arguments to `PULSAR_MEM`
At this point, it looks like the leak is a symptom, rather than a root
cause. It also occurs after several of those `ERROR
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Failed to add entry for
ledger 478895 in time-out 120 sec` logs have appeared on that broker. The leak
message was `ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release()
was not called before it's garbage-collected.`
The WARN-level logs around the leak are
[leak_log.txt](https://github.com/apache/pulsar/files/6548597/leak_log.txt).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]