devinbost commented on issue #6054:
URL: https://github.com/apache/pulsar/issues/6054#issuecomment-875106224


   After spending many hours digging through the ack paths and not finding any 
issues, I took another look at the client (`ProducerImpl`), and I noticed 
something interesting in the heap dump.
   `ProducerImpl` is in a `Connecting` state, so it's waiting to get the 
connection.
   However, `[ProducerImpl].connectedSince` says it's been connected for about 
3 days... 
   If it's in a `Connecting` state, that implies it was disconnected at some 
point. 
   
   If there's a connectivity issue, that would explain why this bug has been so 
hard to reproduce and why it can't be reproduced locally. It could be that some 
network hardware is doing something weird with the connection, and the client 
doesn't handle it correctly and gets stuck in a `Connecting` state and doesn't 
try re-establishing the connection. So, it's just waiting forever for the 
connection to establish. 
   When in the `Connecting` state, in `ProducerImpl.sendAsync(..)` , 
`ProducerImpl.isValidProducerState(..)` returns true even though we haven't 
completed the establishment of the connection. When that method returns true, 
it allows the producer to enqueue messages.
   Sure enough, `[ProducerImpl].connectionHandler.state.pendingMessages` 
contains all (exactly) 1000 of the OpSendMsg that are blocking the semaphore. 
(The semaphore blocks once 1000 messages accumulate that haven't been ack'd.)
   It also has exactly 1000 null value entries for `pendingCallbacks`, but I'm 
not sure if that means anything.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to