lhotari opened a new pull request, #23226: URL: https://github.com/apache/pulsar/pull/23226
Main Issue: #23200 ### Motivation There's currently a clear problem with Key_Shared that in normal operations, it causes a lot of "ack holes" which result in several problems. One of the problems is the latency issues that are explained in #23200. Another problem is that the large number of "ack holes" exceed managedLedgerMaxUnackedRangesToPersist (10000) in usual cases such as in the demonstration in #23200. There are multiple other issues where there has been a large number of "ack holes" when Pulsar users have experienced problems. One of the previous mitigations is [PIP-299: Stop dispatch messages if the individual acks will be lost in the persistent storage.](https://github.com/apache/pulsar/blob/master/pip/pip-299.md). The need for PIP-299 proves that the large number of "ack holes" is a fairly common problem. ### Modifications While experimenting on #23200, it was determined that #7105 changes were related to the cause of the issue. I also noticed that #18315 contained some impactful changes (https://github.com/apache/pulsar/pull/18315/files#diff-c48d5c94842ac8c9a0c9314b207298069f38c8dcfeda4a9886fb3bb1f77843f2). Based on this information, I decided to implement a solution where there would be a backoff when no messages are dispatched. This PR contains a change that reschedules a call to `readMoreEntries` where the delay is exponentially increasing as long as no entries are dispatched. The backoff delay starts at 100ms and is limited to 5000ms. These values are currently static but they could be made configurable. ### Additional context While testing this change, I happened to notice that this change mitigates the problem in the reproducer of of #23200. With the changes of this PR, these are the results: ``` 2024-08-26T16:09:42,328+0300 [main] INFO playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 duplicates: 0 unique: 1000000 max latency difference of subsequent messages: 974 ms max ack holes: 668 2024-08-26T16:09:42,329+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer1 received 259642 unique messages 0 duplicates in 456 s, max latency difference of subsequent messages 763 ms 2024-08-26T16:09:42,329+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer2 received 233963 unique messages 0 duplicates in 456 s, max latency difference of subsequent messages 974 ms 2024-08-26T16:09:42,329+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer3 received 244279 unique messages 0 duplicates in 457 s, max latency difference of subsequent messages 898 ms 2024-08-26T16:09:42,329+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer4 received 262116 unique messages 0 duplicates in 456 s, max latency difference of subsequent messages 657 ms ``` ### Documentation <!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. --> - [ ] `doc` <!-- Your PR contains doc changes. --> - [ ] `doc-required` <!-- Your PR changes impact docs and you will update later --> - [x] `doc-not-needed` <!-- Your PR changes do not impact docs --> - [ ] `doc-complete` <!-- Docs have been already added --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
