lhotari opened a new issue #10738:
URL: https://github.com/apache/pulsar/issues/10738


   **Describe the bug**
   
   Broker (tested with 2.6.3, 2.7.2 and master branch) runs out of direct 
memory under certain conditions. 
   The Netty leak detector shows leaks (leak detector enabled with 
`-Dpulsar.allocator.leak_detection=Advanced 
-Dio.netty.leakDetectionLevel=advanced`). 
   
   ```
   11:53:26.531 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR 
io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called 
before it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
   11:53:26.538 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR 
io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called 
before it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
   11:53:26.544 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR 
io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called 
before it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
   11:53:51.894 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - 
LEAK: ByteBuf.release() was not called before it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
   11:53:51.898 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - 
LEAK: ByteBuf.release() was not called before it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
   11:53:51.901 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - 
LEAK: ByteBuf.release() was not called before it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
   11:53:51.902 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - 
LEAK: ByteBuf.release() was not called before it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
   11:53:51.903 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - 
LEAK: ByteBuf.release() was not called before it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
   11:53:51.904 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - 
LEAK: ByteBuf.release() was not called before it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
   11:53:51.905 [pulsar-io-24-1] ERROR io.netty.util.ResourceLeakDetector - 
LEAK: ByteBuf.release() was not called before it's garbage-collected. See 
https://netty.io/wiki/reference-counted-objects.html for more information.
   11:53:56.585 [pulsar-io-24-2] ERROR org.apache.pulsar.PulsarBrokerStarter - 
-- Shutting down - Received OOM exception: failed to allocate 16777216 byte(s) 
of direct memory (used: 1056964615, max: 1073741824)
   ```
   
   
![image](https://user-images.githubusercontent.com/66864/119982448-85da3780-bfc7-11eb-805e-64acba567d32.png)
   
   
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   In first terminal window (leave terminal open):
   ```
   # checkout repro repo
   git clone https://github.com/lhotari/pulsar-issue-6054-repro
   cd pulsar-issue-6054-repro
   
   # deployment to minikube (or similar local environment) with enough RAM 
(8GB?)
   helm repo add apache https://pulsar.apache.org/charts
   helm repo update
   ./helm-deployment/redeploy_1node_cluster.sh
   
   # forward ports 8080 and 6650 to deployed environment
   # wait until deployment is completed, then run
   ./helm-deployment/pulsar_port_forward.sh
   ```
   
   In each of the following terminal windows, configure pulsar-admin / 
pulsar-perf connectivity to the deployed environment
   ```
   # set pulsar-client / pulsar-admin / pulsar-perf settings
   export PULSAR_CLIENT_CONF=$PWD/helm-deployment/client.conf
   
   # before continuing, ensure that pulsar-admin and pulsar-perf are on the PATH
   # for example
   export PATH="$HOME/apache-pulsar-2.7.1/bin:$PATH"
   ```
   
   Deploying the functions used in the repro
   ```
   ./deploy_function_chain.sh
   ```
   
   Open 2 terminal windows with PULSAR_CLIENT_CONF and the PATH with 
pulsar-perf.
   
   Run producer:
   ```
   ./produce.sh
   ```
   
   Run consumer
   ```
   ./consume.sh
   ```
   
   In one terminal window, grep the logs
   ```
   kubectl -n pulsar-testenv logs -f pod/pulsar-testenv-deployment-broker-0 
|egrep 'LEAK|OOM'
   ```
   
   **Expected behavior**
   
   Direct memory shouldn't be leaked, even when the system is under load. Netty 
buffers should be released before discarding.
   
   **Additional context**
   
   The broker is intentionally using special configuration that makes it easier 
to reproduce the issue.
   ```
   broker:
     configData:
       backlogQuotaDefaultRetentionPolicy: producer_exception
       bookkeeperEnableStickyReads: 'true'
       brokerDeduplicationEnabled: 'false'
       brokerDeleteInactiveTopicsEnabled: 'false'
       exposeConsumerLevelMetricsInPrometheus: 'false'
       exposeTopicLevelMetricsInPrometheus: 'true'
       managedLedgerAddEntryTimeoutSeconds: '120'
       managedLedgerMaxEntriesPerLedger: '10'
       managedLedgerMaxLedgerRolloverTimeMinutes: '2'
       managedLedgerMinLedgerRolloverTimeMinutes: '0'
       managedLedgerReadEntryTimeoutSeconds: '120'
       topicFencingTimeoutSeconds: '5'
       maxUnackedMessagesPerConsumer: '0'
       maxUnackedMessagesPerSubscription: '0'   
   ```
   the main configuration setting to trigger the problem is 
`managedLedgerMaxEntriesPerLedger=10`. It's not a value to use in production 
configurations. The current default for `managedLedgerMaxEntriesPerLedger` is 
`50000`. However when using a really low value, it can stress test the ledger 
rollover functionality in the broker. This reveals that there is some execution 
path where Netty buffers are discarded without releasing them. This shouldn't 
happen even when the broker is overloaded. The expectation is to first fix that 
issue. 
   
   This test scenario shows also another problem. Sometimes the broker gets 
into a state where the direct memory doesn't run out and instead the random1-in 
topic "locks up". The stats-internal shows the state of `LedgerClosing` for 
some partitions. In this state, producing gets blocked. This issue might be 
related to #6054 . 
   
   The direct memory leak issue might be related to #9562 or #10136.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to