RaulGracia opened a new issue #2521:
URL: https://github.com/apache/bookkeeper/issues/2521
**BUG REPORT**
***Describe the bug***
We are executing a set of experiments with Bookkeeper 4.9.2 (via Pulsar).
The experiments consists of a set of 5-minute write/read workloads at 250MBps
(1KB events). On each 5-minute workload, we vary the number of
producers/readers (1, 10, 50, 100) and topic partitions (1, 10, 100, 500, 1000,
5000). We use 3 Bookies (jointly deployed with 3 Pulsar Brokers) and we write
to Bookkeeper using `writeQuorum=ensembleSize=3` and `ackQuorumSize=2` (we
provide later more details about deployment/experiments).
After some time (30-45 mins), we observe one Bookie crashing due to
`OutOfMemory`:
```
Dec 10 12:38:46 ip-10-0-0-187.us-west-2.compute.internal pulsar[13379]:
12:38:46.808 [BookieWriteThreadPool-OrderedExecutor-4-0] INFO
org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage - Write cache is full,
triggering flush
Dec 10 12:38:48 ip-10-0-0-187.us-west-2.compute.internal pulsar[13379]:
12:38:48.829 [BookieWriteThreadPool-OrderedExecutor-5-0] INFO
org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage - Write cache is full,
triggering flush
Dec 10 12:38:50 ip-10-0-0-187.us-west-2.compute.internal pulsar[13379]:
12:38:50.172 [db-storage-12-1] INFO
org.apache.bookkeeper.bookie.EntryLogManagerBase - Creating a new entry log
file : createNewLog = false, reachEntryLogLimit = true
Dec 10 12:38:50 ip-10-0-0-187.us-west-2.compute.internal pulsar[13379]:
12:38:50.178 [pool-5-thread-1] INFO
org.apache.bookkeeper.bookie.EntryLoggerAllocator - Created new entry log file
/mnt/storage/current/1b8.log for logId 440.
Dec 10 12:38:50 ip-10-0-0-187.us-west-2.compute.internal pulsar[13379]:
12:38:50.175 [db-storage-12-1] INFO
org.apache.bookkeeper.bookie.EntryLogManagerBase - Flushing entry logger 438
back to filesystem, pending for syncing entry loggers :
[BufferedChannel{logId=438, logFile=/mnt/storage/current/1b6.log,
ledgerIdAssigned=-1}].
Dec 10 12:38:50 ip-10-0-0-187.us-west-2.compute.internal pulsar[13379]:
12:38:50.962 [db-storage-12-1] INFO
org.apache.bookkeeper.bookie.EntryLogManagerForSingleEntryLog - Synced entry
logger 438 to disk.
Dec 10 12:38:52 ip-10-0-0-187.us-west-2.compute.internal pulsar[13379]:
12:38:52.636 [bookie-io-1-6] ERROR org.apache.bookkeeper.proto.BookieServer -
Unable to allocate memory, exiting bookie
Dec 10 12:38:52 ip-10-0-0-187.us-west-2.compute.internal pulsar[13379]:
java.lang.OutOfMemoryError: Direct buffer memory
Dec 10 12:38:52 ip-10-0-0-187.us-west-2.compute.internal pulsar[13379]: at
java.nio.Bits.reserveMemory(Bits.java:175) ~[?:?]
Dec 10 12:38:52 ip-10-0-0-187.us-west-2.compute.internal pulsar[13379]: at
java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118) ~[?:?]
Dec 10 12:38:52 ip-10-0-0-187.us-west-2.compute.internal pulsar[13379]: at
java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317) ~[?:?]
```
By default, we use 6GB of `DirectMemory` in the JVM options. So, in a
subsequent experiment, we increased that number up to 24GB, but we observe the
same problem again:
```
Dec 10 15:42:31 ip-10-0-0-47.us-west-2.compute.internal pulsar[13375]:
15:42:31.327 [db-storage-12-1] INFO org.apache.bookkeeper.bookie.Journal -
garbage collected journal 1764d280e03.txn
Dec 10 15:42:31 ip-10-0-0-47.us-west-2.compute.internal pulsar[13375]:
15:42:31.355 [BookieWriteThreadPool-OrderedExecutor-5-0] INFO
org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage - Write cache is full,
triggering flush
Dec 10 15:42:33 ip-10-0-0-47.us-west-2.compute.internal pulsar[13375]:
15:42:33.533 [db-storage-12-1] INFO org.apache.bookkeeper.bookie.Journal -
garbage collected journal 1764d280de6.txn
Dec 10 15:42:34 ip-10-0-0-47.us-west-2.compute.internal pulsar[13375]:
15:42:34.019 [db-storage-cleanup-13-1] INFO
org.apache.bookkeeper.bookie.storage.ldb.EntryLocationIndex - Deleted indexes
for 18879174 entries from 28 ledgers in 13.935 seconds
Dec 10 15:42:42 ip-10-0-0-47.us-west-2.compute.internal pulsar[13375]:
15:42:42.674 [bookie-io-1-17] ERROR org.apache.bookkeeper.proto.BookieServer -
Unable to allocate memory, exiting bookie
Dec 10 15:42:42 ip-10-0-0-47.us-west-2.compute.internal pulsar[13375]:
java.lang.OutOfMemoryError: Direct buffer memory
Dec 10 15:42:42 ip-10-0-0-47.us-west-2.compute.internal pulsar[13375]: at
java.nio.Bits.reserveMemory(Bits.java:175) ~[?:?]
Dec 10 15:42:42 ip-10-0-0-47.us-west-2.compute.internal pulsar[13375]: at
java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118) ~[?:?]
Dec 10 15:42:42 ip-10-0-0-47.us-west-2.compute.internal pulsar[13375]: at
java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317) ~[?:?]
```
Specially, in the second reproduction, we can observe that the logs are
stalled from `15:42:34.019` to `15:42:42.674`, when the `OutOfMemoryError`
issue appears (+8 seconds). Just in the line before, we see that there is
activity related to garbage collection related to the `DbLedgerStorage`. Before
we reach this point, the benchmark suite has already executed multiple
experiments with hundreds or thousands of Pulsar partitions. As each topic is
deleted after a 5-minute workload, could it be possible that there is some
issue related to the memory resources required by the deletion of
`DbLedgerStorage` when thousands of ledgers are elegible for collection? If
this induces the Bookie to be too busy, and the benchmark continues to write
and read data, then it could be the case that direct memory is exhausted (e.g.,
Netty allocating buffers of writes).
***To Reproduce***
Deployed on AWS using 10 Worker VMs (half for producers and half for
readers) for OpenMessaging Benchmark (c5.4xlarge) and 3 VMs (i3.16xlarge)
jointly running Pulsar Broker/Bookies. The instructions for deployment can be
found here: https://github.com/pravega/openmessaging-benchmark
The logs for the experiments can be found here:
[experiment-1-6GB-direct-memory.zip](https://github.com/apache/bookkeeper/files/5710571/experiment-1-6GB-direct-memory.zip)
[experiment-2-24GB-direct-memory.zip](https://github.com/apache/bookkeeper/files/5710570/experiment-2-24GB-direct-memory.zip)
***Expected behavior***
In our experiments, we have verified that Bookkeeper is able to write faster
than 250MBps in that environment (+400MBps), so we do not consider that the
problem is due to the system being pushed too hard ([see this blog post as a
reference](https://blog.pravega.io/2020/10/01/when-speeding-makes-sense-fast-consistent-durable-and-scalable-streaming-data-with-pravega/)).
Actually, it is working fine for a long time (30-45 mins). However, the fact
of using `ackQuorumSize=2` may make it visible (via an `OutOfMemory` error)
that the Bookie gets stuck sometimes, perhaps while garbage collecting data
from thousands of ledgers using `DBLedgerStorage`. If this hypothesis is
correct, perhaps the expected behavior is that garbage collection would not
induce this kind of problems under moderate/high loads.
***Screenshots***
n/a
***Additional context***
I will try the same experiment with `ackQuorumSize=3` and will update this
issue with the results.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]