utahkay opened a new issue #14736: URL: https://github.com/apache/pulsar/issues/14736
**Describe the bug** Pulsar 2.10 introduces the zookeeperSessionExpiredPolicy=reconnect option. If the zookeeper session is disconnected, the brokers will continue running with as much functionality as possible, and reconnect to zookeeper when it is available again. The problem happens when the brokers reconnect to zookeeper. They seem to be having locks invalidated continuously, so the broker revalidates the lock, then it's invalidated, then the broker again revalidates, then it's invalidated, and so on. The CPU and memory usage gradually increase over the next 10-15 minutes, slowing down message delivery. Eventually, the brokers run out of heap space and die due to OOM. At which point Kubernetes recreates them, and they work fine at that point, without experiencing the lock invalidation problem. Cycling the brokers at any time prior to them dying also fixes the problem. **To Reproduce** Steps to reproduce the behavior: 1. I'm running Pulsar 2.10.0.0-rc12 (StreamNative) in StreamNative Cloud Hosted environment. 2. I have a couple clients putting light load on the Pulsar cluster, publishing and consuming. 3. I kill the zookeeper cluster entirely. 4. I wait for 5 minutes. 5. I bring the zookeeper cluster back up. **Expected behavior** 1. Expect the brokers to reconnect to zookeeper - This succeeds. 2. Expect the brokers to work normally - Instead, they spend much of their time revalidating locks that are immediately invalidated. **Screenshots** A portion of the logs: ``` "Mar 17, 2022 @ 14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,642+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,641+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,641+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired resource lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,642+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:27.642","test-2-10-broker-0","2022-03-17T20:32:27,641+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,652+0000 [metadata-store-9-1] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated" "Mar 17, 2022 @ 14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,651+0000 [metadata-store-9-1] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated" "Mar 17, 2022 @ 14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,652+0000 [metadata-store-9-1] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated" "Mar 17, 2022 @ 14:32:27.652","test-2-10-broker-0","2022-03-17T20:32:27,652+0000 [metadata-store-9-1] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated" "Mar 17, 2022 @ 14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired resource lock on /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:27.959","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired resource lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:27.960","test-2-10-broker-0","2022-03-17T20:32:27,959+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:27.967","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 [metadata-store-9-1] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated" "Mar 17, 2022 @ 14:32:27.967","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 [metadata-store-9-1] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated" "Mar 17, 2022 @ 14:32:27.968","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 [metadata-store-9-1] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated" "Mar 17, 2022 @ 14:32:27.968","test-2-10-broker-0","2022-03-17T20:32:27,967+0000 [metadata-store-9-1] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Lock on resource /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff was invalidated" "Mar 17, 2022 @ 14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,294+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,294+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,293+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Acquired resource lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,294+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully revalidated the lock on /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" "Mar 17, 2022 @ 14:32:28.294","test-2-10-broker-0","2022-03-17T20:32:28,293+0000 [main-EventThread] INFO org.apache.pulsar.metadata.coordination.impl.ResourceLockImpl - Successfully re-acquired lock at /namespace/pulsar/test-2-10/test-2-10-broker-0.test-2-10-broker-headless.cv-pulsar.svc.cluster.local:8080/0x00000000_0xffffffff" .... "Mar 17, 2022 @ 14:36:31.082","test-2-10-broker-0","Terminating due to java.lang.OutOfMemoryError: Java heap space" ```   **Desktop (please complete the following information):** - StreamNative Cloud Hosted (Ubuntu) **Additional context** I notice in the logs, after a few minutes, there start to be some zookeeper and MetadataStore disconnects / reconnects. More logs: [test-2-10-broker-0.log.csv](https://github.com/apache/pulsar/files/8292566/test-2-10-broker-0.log.csv) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
