YanshuoH opened a new issue, #25028: URL: https://github.com/apache/pulsar/issues/25028
### Search before reporting - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### Read release policy - [x] I understand that [unsupported versions](https://pulsar.apache.org/contribute/release-policy/#supported-versions) don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker. ### User environment - broker version: 4.0.8 - broker os: Linux pulsar-broker-1a-0 6.12.40-64.114.amzn2023.aarch64 #1 SMP Tue Aug 26 05:25:54 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux - java: openjdk version "17.0.12" 2024-07-16 - client: golang - client version: 0.17.0 - client os: same as broker - client java version: NaN ### Issue Description One of our scenario is check user's payment with variadic delay, from 10s to 1h indifferent. My observation is that when the individuallyDeletedMessages becomes quite big (100,000+, and the setting `managedLedgerMaxUnackedRangesToPersist` is 100,000 too), dispatching of messages become strange. The message dispatch is very slow and most messages don't get dispatched. Checking the internal-stats, I can see something as such: ``` "numberOfEntriesSinceFirstNotAckedMessage": 751170, "totalNonContiguousDeletedMessagesRange": 105911, ``` No more error message on both client and server side. I see there's a similar issue https://github.com/apache/pulsar/issues/23200, yet we're using Shared subscription type. ### Error messages ```text The suspicious message I got is: client side tries to reconnect to the broker with: INFO[0960] Connecting to broker remote_addr="pulsar://pulsar-broker.pulsar1.svc.cluster.local:6650" INFO[0960] TCP connection established local_addr="10.120.147.140:56018" remote_addr="pulsar://pulsar-broker.pulsar1.svc.cluster.local:6650" INFO[0960] Connection is ready local_addr="10.120.147.140:56018" remote_addr="pulsar://pulsar-broker.pulsar1.svc.cluster.local:6650" And the server has a shedding performed. Since it is very costy to have the DEBUG level log turned on, I didn't have the chance to catch debug level messages. ``` ### Reproducing the issue I've written two parts that can reproduce such issue. Producer that would delivery messages with variadic delay (from 10s to 1h). Consumer that would receive messages. Wait for the message cumulate until the expected number, the consumer hangs with very little message received. ### Additional information It might relates to the setting of `managedLedgerMaxUnackedRangesToPersist` but for our usage type, it is not possible to increase this setting infinitely because the message would grow. Also I've notice that when the `individuallyDeletedMessages` is quite big, every time a consumer reconnect to the broker would cause both broker and zookeeper to have a peak CPU usage, I assume it is because pulsar was trying to compute the actual messages that shall be dispatched. I wonder if there's a way to optimize such issue or a way to tune it ? Or this is not the correct way of using pulsar ? ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
