lhotari opened a new issue #8138:
URL: https://github.com/apache/pulsar/issues/8138


   ## Description
   
   There's a memory leak in the Pulsar Java Client that happens under high 
load. This happens in the case of using Reader API with a lot of shortly used 
Reader instances (instances created-used-closed, with async api) and when the 
Pulsar server side (brokers/bookies) is under heavy load and doesn't respond to 
all requests because of an overloaded situation.
   
   The symptom is that the heap memory consumption grows until an out of memory 
error happens. 
   After running out of memory, the system sometimes in able to resume 
operations. After some time, the memory gets freed since there is some behavior 
that closes the connection (perhaps related to 
maxNumberOfRejectedRequestPerConnection). Closing the connection releases all 
the memory tied to ClientCnx and the system resumes. However GC uses about 50% 
of CPU before the system stalls completely.
   
   By analysing the heap dumps, the observation is that there are a lot of 
CompletableFutures in pendingGetLastMessageIdRequests and they don't get 
removed. 
   
   This is happening in an application that extensively uses Reader API and 
short living Reader instances. A Reader is created, used and then closed. The 
asynchronous API is used. 
   
   The pending get last message id requests are originating from the Reader API 
usage. By looking at Pulsar Java client source code, it looks like closing the 
Reader doesn't remove the last message id requests from the ClientCnx, thus the 
CompletableFutures held in ClientCnx's pendingGetLastMessageIdRequests keep a 
strong reference to all of the Reader's underlying ConsumerImpl references and 
that prevents them from being garbage collected.
   pendingGetLastMessageIdRequests doesn't have a timeout solution in ClientCnx 
like there is for pendingLookupRequests or pendingRequests.
   Since each ConsumerImpl consumes a lot of memory (#7680), the heap is 
quickly filled and the JVM runs out of memory.
   When a ClientCnx is closed, the memory gets released so this is why the 
system is able to resume after an OOM.
   However it becomes almost completely unavailable since 50% of CPU is used in 
constant Full GCs before the connection gets closed and the memory gets 
released.
   
   ## Current behavior
   
   * Using the Reader API for a lot of operations under heavy load causes the 
client's memory consumption grow until there is an OOM.
   
   ## Expected behavior
   
   * When a Consumer or Reader is closed, it is expected that all related 
resources are removed and cleaned so that memory isn't leaked. No references 
should be held to the closed Consumer or Reader instance. Currenty the 
pendingGetLastMessageIdRequests are holding references to the ConsumerImpl 
instances.
   * When the server doesn't reply to a get last message id request, there 
would be timeout handling that completes the future held in 
pendingGetLastMessageIdRequests 
   * When the system is under heavy load, that there would be proper 
backpressure for the Reader API and it wouldn't lead to a situation where the 
system breaks under heavy load. Backpressure can happen in the form of 
rejecting requests. Some type of backpressure is necessary so that an 
application using the Reader API can reject requests to it's own clients and 
there is end-to-end backpressure in place. I assume that the design of the 
Pulsar Client is already handling this. The expectation is that also the Reader 
API would have back pressure in a form or another. 
   
   Pulsar Client version: 2.6.1
   Java 11.0.7


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to