sijie commented on issue #6173: Log compaction fails due to timeout URL: https://github.com/apache/pulsar/issues/6173#issuecomment-582524154 (loop @codelipenghui in this thread since he has good experiences on dealing with permits related issues on production) @fantapsody regarding the 3) question you posted in https://github.com/apache/pulsar/issues/6173#issuecomment-582480075, it does look like a flow-control problem. I think the main problem is here: https://github.com/apache/pulsar/blob/402ecec9c3731711fd1bf700c30d09678aa9b1e5/pulsar-broker/src/main/java/org/apache/pulsar/client/impl/RawReaderImpl.java#L144 The "permits" in Pulsar's flow control logic is designed for "messages" not "batches". That says one permit per message. So broker decrementing the permits after it dispatched messages. It is okay to see negative "permits" at the broker side because it means that broker dispatched more messages that the consumer requested. However, it is NOT okay if the consumer counts the permits by batches. For example, a consumer/reader requests 100 permits (the receiver queue size), the broker dispatches 1000 "messages" (10 message per batch). Hence the "permits" at broker side will become -900. Since RawReaderImpl treats one permit per batch, it will request 50 permits again from the broker once the available permits at reader side reach the threshold (which is half of the receiver queue size). The broker receives 50 permits and increases its available permits back to -850. But it is still negative. So broker will not read any more messages and cause raw reader to be stuck at waiting for new messages. The problem of 3) can be addressed by fixing https://github.com/apache/pulsar/blob/402ecec9c3731711fd1bf700c30d09678aa9b1e5/pulsar-broker/src/main/java/org/apache/pulsar/client/impl/RawReaderImpl.java#L144 to use number of messages as permits. --- Besides this problem, @codelipenghui has seen a lot of "consumer stuck" problems related to permits. We have been discussing how to introduce a self-healing mechanism into current flow-control logic to avoid similar mistakes in client implementation. I think we should discuss a more generic approach to improve flow-control after fixing this issue here.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
