lhotari opened a new issue, #23200: URL: https://github.com/apache/pulsar/issues/23200
### Search before asking - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Read release policy - [X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker. ### Version This issue is present in master, branch-3.0 and branch-3.3 branch. It seems to be present in all released Pulsar versions since #7105. ### Minimal reproduce step Start Pulsar 3.3.1 in Docker: ``` docker run --name pulsar-standalone --rm -it -e PULSAR_STANDALONE_USE_ZOOKEEPER=1 -p 8080:8080 -p 6650:6650 apachepulsar/pulsar:3.3.1 /pulsar/bin/pulsar standalone -nss -nfw ``` Clone, compile & run the test app ([source code](https://github.com/lhotari/pulsar-playground/blob/key_shared_issue-2024-08-19/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioIssueKeyShared.java)) that reproduces the issue: ``` git clone -b key_shared_issue-2024-08-19 --depth=1 https://github.com/lhotari/pulsar-playground cd pulsar-playground ./gradlew build java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioIssueKeyShared ``` The test takes over 7 minutes to run and the final log lines show the results: ``` 2024-08-19T15:31:35,983+0300 [main] INFO playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 duplicates: 0 unique: 1000000 max latency difference of subsequent messages: 53071 ms max ack holes: 27103 2024-08-19T15:31:35,983+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer1 received 259642 unique messages 0 duplicates in 445 s, max latency difference of subsequent messages 52959 ms 2024-08-19T15:31:35,983+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer2 received 233963 unique messages 0 duplicates in 417 s, max latency difference of subsequent messages 50930 ms 2024-08-19T15:31:35,983+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer3 received 244279 unique messages 0 duplicates in 437 s, max latency difference of subsequent messages 53071 ms 2024-08-19T15:31:35,983+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer4 received 262116 unique messages 0 duplicates in 443 s, max latency difference of subsequent messages 52651 ms ``` What it tracked that there was a 53071 ms latency spike between 2 subsequent messages. The maximum value of `totalNonContiguousDeletedMessagesRange` observed during the test was 27103, which is more than the default value of `managedLedgerMaxUnackedRangesToPersist`. The test will use a random namespace name in the format `test_ns1723788349510`. use this command to find out the namespace name: ``` pulsar-admin namespaces list public ``` After that, you can check the stats and stats-internal: ``` pulsar-admin topics stats-internal persistent://public/test_ns1723788349510/test pulsar-admin topics stats persistent://public/test_ns1723788349510/test ``` There's a large amount of "ack holes" seen in the `totalNonContiguousDeletedMessagesRange` stats metric of the subscription `sub`. ### What did you expect to see? In an application that uses Key_Shared subscription: * messages should be delivered without unnecessary delays. * there shouldn't be a reason that there's a large amount of "ack holes" (`totalNonContiguousDeletedMessagesRange`) in the subscription so that it exceeds the default `managedLedgerMaxUnackedRangesToPersist` of 10000 The test application, https://github.com/lhotari/pulsar-playground/blob/key_shared_issue-2024-08-19/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioIssueKeyShared.java * acknowledges all messages in the received order * in processing of each message, it contains a random processing delay of 100ms with 3% probability (this seems to be necessary to amplify the problem) ### What did you see instead? In an application using Key_Shared subscription: * messages got delivered messages late without a proper reason. In the repro app, a latency spike of over 50 seconds later is compared while comparing the end-to-end latency (from producing to consuming) of subsequent messages. * there were a lot of "ack holes". `totalNonContiguousDeletedMessagesRange` could exceed `managedLedgerMaxUnackedRangesToPersist` which means that subscription state would be lost in broker restart or topic load balancing event ### Anything else? This problem seems to reproduce only in the backlogged cases where there's already an existing backlog or when consumers aren't able to keep up with the producer. The problem is resolved after the consumers catch up, but the intermediate state is that messages get delivered late and totalNonContiguousDeletedMessagesRange could exceed managedLedgerMaxUnackedRangesToPersist during the catch up time. This seems to be completely unnecessary. I made an experiment where I reverted some of the #7105 changes in this commit: https://github.com/lhotari/pulsar/commit/5665b112e3d477c2afd49cd1b21f0fb35defc1e3. Here's how to build and run pulsar-standalone for this experiment (instead of running pulsar-standalone in docker): ``` git clone -b lh-key-shared-experiment-2024-08-19 --depth=2 --single-branch https://github.com/lhotari/pulsar cd pulsar mvn -Pcore-modules,-main -T 1C clean install -DskipTests -Dspotbugs.skip=true -Dcheckstyle.skip=true -Dlicense.skip=true -DnarPluginPhase=none rm -rf data PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log ``` after this, it's possible to run `java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioIssueKeyShared` as explained before. The results for this experiment: ``` 2024-08-19T15:48:49,718+0300 [main] INFO playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 duplicates: 0 unique: 1000000 max latency difference of subsequent messages: 922 ms max ack holes: 711 2024-08-19T15:48:49,718+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer1 received 259642 unique messages 0 duplicates in 449 s, max latency difference of subsequent messages 606 ms 2024-08-19T15:48:49,718+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer2 received 233963 unique messages 0 duplicates in 449 s, max latency difference of subsequent messages 922 ms 2024-08-19T15:48:49,718+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer3 received 244279 unique messages 0 duplicates in 449 s, max latency difference of subsequent messages 835 ms 2024-08-19T15:48:49,718+0300 [main] INFO playground.TestScenarioIssueKeyShared - Consumer consumer4 received 262116 unique messages 0 duplicates in 450 s, max latency difference of subsequent messages 716 ms ``` The observed problem is resolved in the experiment. The maximum latency spike is 922 ms compared to 53071. There is no ack hole problem since maximum amount of observed ack holes (`totalNonContiguousDeletedMessagesRange`) is 711 compared to 27103 ack holes. However it's likely that the correct fix is a broader change in how replays are handled in Key_Shared subscription. -- It's possible that the issue observed in this report is related to #21199 and #21657 / #21656. The PR #21656 is related to replays with Key_Shared subscriptions. ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
