rdhabalia opened a new pull request #7499: URL: https://github.com/apache/pulsar/pull/7499
### Motivation We have seen multiple different scenarios when broker suddenly sees huge spike in heap-memory usage and consumes all allocated heap-memory and eventually it crashes with OOM. One of the scenarios for broker crashing with OOM is broker can't handle the back-pressure from bookie add-entry timeout. Broker limits max-pending messages per topic but it doesn't limit total number of pending messages across all topics. if broker is serving many topics with high publish rate and due to some reasons if broker started seeing add-entry timeout from bk-client then it allocates large number of non-recyclable objects which starts causing high GC and eventually it crashes with OOM. We saw many brokers crashed same time due to bk n/w partitioning/bk add-entry high add-latency. It can be easily reproducible by simulating bookie behavior which can cause`Bookie operation timeout` error at broker , and publish with 30K-40K rate with 1K topics. Therefore, we need a mechanism to handle bookie back-pressure at broker by limiting number of pending messages across all topics in the broker. Broker-Error: Add-entry timing out at bk-client ``` org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [prop/cluster/ns/persistent/t1] Created new ledger 123456 13:25:04.468 [BookKeeperClientWorker-OrderedExecutor-20-0] WARN org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (123456, 1): Bookie operation timeout 13:25:04.469 [BookKeeperClientWorker-OrderedExecutor-20-0] WARN org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (123456, 2): Bookie operation timeout 13:25:04.469 [BookKeeperClientWorker-OrderedExecutor-20-0] WARN org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (123456, 3): Bookie operation timeout 13:25:04.469 [BookKeeperClientWorker-OrderedExecutor-20-0] WARN org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (123456, 4): Bookie operation timeout ``` Broker sees sudden spike in heap memory usage and crashes  ### Modification - add configuration to restrict total pending publish messages across all topics in a broker: `maxConcurrentPendingPublishMessages` - by default this feature will be disable with value =0 and will not change any existing behavior ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
