sijie commented on issue #6173: Log compaction fails due to timeout
URL: https://github.com/apache/pulsar/issues/6173#issuecomment-582524154
 
 
   (loop @codelipenghui in this thread since he has good experiences on dealing 
with permits related issues on production)
   
   @fantapsody 
   
   regarding the 3) question you posted in 
https://github.com/apache/pulsar/issues/6173#issuecomment-582480075, it does 
look like a flow-control problem.
   
   I think the main problem is here: 
https://github.com/apache/pulsar/blob/402ecec9c3731711fd1bf700c30d09678aa9b1e5/pulsar-broker/src/main/java/org/apache/pulsar/client/impl/RawReaderImpl.java#L144
   
   The "permits" in Pulsar's flow control logic is designed for "messages" not 
"batches". That says one permit per message. So broker decrementing the permits 
after it dispatched messages. It is okay to see negative "permits" at the 
broker side because it means that broker dispatched more messages that the 
consumer requested. 
   
   However, it is NOT okay if the consumer counts the permits by batches. For 
example, a consumer/reader requests 100 permits (the receiver queue size), the 
broker dispatches 1000 "messages" (10 message per batch). Hence the "permits" 
at broker side will become -900. Since RawReaderImpl treats one permit per 
batch, it will request 50 permits again from the broker once the available 
permits at reader side reach the threshold (which is half of the receiver queue 
size).
   
   The broker receives 50 permits and increases its available permits back to 
-850. But it is still negative. So broker will not read any more messages and 
cause raw reader to be stuck at waiting for new messages.
   
   The problem of 3) can be addressed by fixing 
https://github.com/apache/pulsar/blob/402ecec9c3731711fd1bf700c30d09678aa9b1e5/pulsar-broker/src/main/java/org/apache/pulsar/client/impl/RawReaderImpl.java#L144
 to use number of messages as permits.
   
   ---
   
   Besides this problem, @codelipenghui has seen a lot of "consumer stuck" 
problems related to permits. We have been discussing how to introduce a 
self-healing mechanism into current flow-control logic to avoid similar 
mistakes in client implementation. I think we should discuss a more generic 
approach to improve flow-control after fixing this issue here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to