lhotari opened a new issue, #23200:
URL: https://github.com/apache/pulsar/issues/23200

   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Read release policy
   
   - [X] I understand that unsupported versions don't get bug fixes. I will 
attempt to reproduce the issue on a supported version of Pulsar client and 
Pulsar broker.
   
   
   ### Version
   
   This issue is present in master, branch-3.0 and branch-3.3 branch. 
   It seems to be present in all released Pulsar versions since #7105.
   
   ### Minimal reproduce step
   
   Start Pulsar 3.3.1 in Docker:
   ```
   docker run --name pulsar-standalone --rm -it -e 
PULSAR_STANDALONE_USE_ZOOKEEPER=1 -p 8080:8080 -p 6650:6650 
apachepulsar/pulsar:3.3.1 /pulsar/bin/pulsar standalone -nss -nfw
   ```
   
   Clone, compile & run the test app ([source 
code](https://github.com/lhotari/pulsar-playground/blob/key_shared_issue-2024-08-19/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioIssueKeyShared.java))
 that reproduces the issue:
   ```
   git clone -b key_shared_issue-2024-08-19 --depth=1 
https://github.com/lhotari/pulsar-playground
   cd pulsar-playground
   ./gradlew build
   java -cp build/libs/pulsar-playground-all.jar 
com.github.lhotari.pulsar.playground.TestScenarioIssueKeyShared
   ```
   
   The test takes over 7 minutes to run and the final log lines show the 
results:
   ```
   2024-08-19T15:31:35,983+0300 [main] INFO  
playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 
duplicates: 0 unique: 1000000
   max latency difference of subsequent messages: 53071 ms
   max ack holes: 27103
   2024-08-19T15:31:35,983+0300 [main] INFO  
playground.TestScenarioIssueKeyShared - Consumer consumer1 received 259642 
unique messages 0 duplicates in 445 s, max latency difference of subsequent 
messages 52959 ms
   2024-08-19T15:31:35,983+0300 [main] INFO  
playground.TestScenarioIssueKeyShared - Consumer consumer2 received 233963 
unique messages 0 duplicates in 417 s, max latency difference of subsequent 
messages 50930 ms
   2024-08-19T15:31:35,983+0300 [main] INFO  
playground.TestScenarioIssueKeyShared - Consumer consumer3 received 244279 
unique messages 0 duplicates in 437 s, max latency difference of subsequent 
messages 53071 ms
   2024-08-19T15:31:35,983+0300 [main] INFO  
playground.TestScenarioIssueKeyShared - Consumer consumer4 received 262116 
unique messages 0 duplicates in 443 s, max latency difference of subsequent 
messages 52651 ms
   ```
   
   What it tracked that there was a 53071 ms latency spike between 2 subsequent 
messages. 
   The maximum value of `totalNonContiguousDeletedMessagesRange` observed 
during the test was 
   27103, which is more than the default value of 
`managedLedgerMaxUnackedRangesToPersist`.
   
   
   The test will use a random namespace name in the format 
`test_ns1723788349510`. use this command to find out the namespace name:
   ```
   pulsar-admin namespaces list public
   ```
   
   After that, you can check the stats and stats-internal:
   ```
   pulsar-admin topics stats-internal 
persistent://public/test_ns1723788349510/test
   pulsar-admin topics stats persistent://public/test_ns1723788349510/test
   ```
   
   There's a large amount of "ack holes" seen in the 
`totalNonContiguousDeletedMessagesRange` stats metric of the subscription `sub`.
   
   ### What did you expect to see?
   
   In an application that uses Key_Shared subscription:
   * messages should be delivered without unnecessary delays.
   * there shouldn't be a reason that there's a large amount of "ack holes" 
(`totalNonContiguousDeletedMessagesRange`) in the subscription so that it 
exceeds the default `managedLedgerMaxUnackedRangesToPersist` of 10000
   
   The test application, 
https://github.com/lhotari/pulsar-playground/blob/key_shared_issue-2024-08-19/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioIssueKeyShared.java
 
   * acknowledges all messages in the received order
   * in processing of each message, it contains a random processing delay of 
100ms with 3% probability (this seems to be necessary to amplify the problem)
   
   
   
   
   ### What did you see instead?
   
   In an application using Key_Shared subscription:
   * messages got delivered messages late without a proper reason. In the repro 
app, a latency spike of over 50 seconds later is compared while comparing the 
end-to-end latency (from producing to consuming) of subsequent messages.
   * there were a lot of "ack holes". `totalNonContiguousDeletedMessagesRange` 
could exceed `managedLedgerMaxUnackedRangesToPersist` which means that 
subscription state would be lost in broker restart or topic load balancing event
   
   ### Anything else?
   
   This problem seems to reproduce only in the backlogged cases where there's 
already an existing backlog or when consumers aren't able to keep up with the 
producer. The problem is resolved after the consumers catch up, but the 
intermediate state is that messages get delivered late and 
totalNonContiguousDeletedMessagesRange could exceed 
managedLedgerMaxUnackedRangesToPersist during the catch up time. This seems to 
be completely unnecessary.
   
   I made an experiment where I reverted some of the #7105 changes in this 
commit: 
https://github.com/lhotari/pulsar/commit/5665b112e3d477c2afd49cd1b21f0fb35defc1e3.
   
   Here's how to build and run pulsar-standalone for this experiment (instead 
of running pulsar-standalone in docker):
   ```
   git clone -b lh-key-shared-experiment-2024-08-19 --depth=2 --single-branch 
https://github.com/lhotari/pulsar
   cd pulsar
   mvn -Pcore-modules,-main -T 1C clean install -DskipTests 
-Dspotbugs.skip=true -Dcheckstyle.skip=true -Dlicense.skip=true 
-DnarPluginPhase=none
   rm -rf data
   PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee 
standalone.log
   ```
   
   after this, it's possible to run `java -cp 
build/libs/pulsar-playground-all.jar 
com.github.lhotari.pulsar.playground.TestScenarioIssueKeyShared` as explained 
before.
   
   The results for this experiment:
   ```
   2024-08-19T15:48:49,718+0300 [main] INFO  
playground.TestScenarioIssueKeyShared - Done receiving. Remaining: 0 
duplicates: 0 unique: 1000000
   max latency difference of subsequent messages: 922 ms
   max ack holes: 711
   2024-08-19T15:48:49,718+0300 [main] INFO  
playground.TestScenarioIssueKeyShared - Consumer consumer1 received 259642 
unique messages 0 duplicates in 449 s, max latency difference of subsequent 
messages 606 ms
   2024-08-19T15:48:49,718+0300 [main] INFO  
playground.TestScenarioIssueKeyShared - Consumer consumer2 received 233963 
unique messages 0 duplicates in 449 s, max latency difference of subsequent 
messages 922 ms
   2024-08-19T15:48:49,718+0300 [main] INFO  
playground.TestScenarioIssueKeyShared - Consumer consumer3 received 244279 
unique messages 0 duplicates in 449 s, max latency difference of subsequent 
messages 835 ms
   2024-08-19T15:48:49,718+0300 [main] INFO  
playground.TestScenarioIssueKeyShared - Consumer consumer4 received 262116 
unique messages 0 duplicates in 450 s, max latency difference of subsequent 
messages 716 ms
   ```
   
   The observed problem is resolved in the experiment. The maximum latency 
spike is 922 ms compared to 53071.
   There is no ack hole problem since maximum amount of observed ack holes 
(`totalNonContiguousDeletedMessagesRange`) is 711 compared to 27103 ack holes.
   However it's likely that the correct fix is a broader change in how replays 
are handled in Key_Shared subscription. 
   
   --
   
   It's possible that the issue observed in this report is related to #21199 
and #21657 / #21656.
   The PR #21656 is related to replays with Key_Shared subscriptions.
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to