snowcrumble opened a new issue #6305: [Proxy]If zk leader was down, proxy was not work until restart it URL: https://github.com/apache/pulsar/issues/6305 **Describe the bug** My pulsar is running on k8s and our service is using pulsar proxy, I have did some failure testing. When I force crash the node which serve the zk leader, all my pulsar broker crash several times because zk connection problem, they were started final, but I must restart pulsar proxy after brokers were availabled, mentioned by #6158 . **To Reproduce** Steps to reproduce the behavior: 1. Deploy pulsar on k8s 2. use pulsar-perf tool to connect pulsar proxy and continuous pub/sub on a topic 3. "force" crash a k8s node which has zk leader 4. The reconnect mechanism doesn't work until restart pulsar proxys. **Expected behavior** Proxy can recover without restart. Or broker keep available without restart. **Screenshots** Broker error log ``` 04:55:22.078 [main] ERROR org.apache.pulsar.zookeeper.GlobalZooKeeperCache - Failed to establish global zookeeper session: null java.util.concurrent.TimeoutException: null at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_232] at org.apache.pulsar.zookeeper.GlobalZooKeeperCache.start(GlobalZooKeeperCache.java:69) [org.apache.pulsar-pulsar-zookeeper-utils-2.4.2.jar:2.4.2] at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:572) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:366) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:332) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] 04:55:22.082 [main] ERROR org.apache.pulsar.broker.PulsarService - java.io.IOException: java.util.concurrent.TimeoutException org.apache.pulsar.broker.PulsarServerException: java.io.IOException: java.util.concurrent.TimeoutException at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:574) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:366) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:332) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] Caused by: java.io.IOException: java.util.concurrent.TimeoutException at org.apache.pulsar.zookeeper.GlobalZooKeeperCache.start(GlobalZooKeeperCache.java:76) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.2.jar:2.4.2] at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:572) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] ... 3 more Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_232] at org.apache.pulsar.zookeeper.GlobalZooKeeperCache.start(GlobalZooKeeperCache.java:69) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.2.jar:2.4.2] at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:572) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] ... 3 more 04:55:22.083 [main] ERROR org.apache.pulsar.PulsarBrokerStarter - Failed to start pulsar service. org.apache.pulsar.broker.PulsarServerException: org.apache.pulsar.broker.PulsarServerException: java.io.IOException: java.util.concurrent.TimeoutException at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:472) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] at org.apache.pulsar.PulsarBrokerStarter$BrokerStarter.start(PulsarBrokerStarter.java:273) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] at org.apache.pulsar.PulsarBrokerStarter.main(PulsarBrokerStarter.java:332) [org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] Caused by: org.apache.pulsar.broker.PulsarServerException: java.io.IOException: java.util.concurrent.TimeoutException at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:574) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:366) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] ... 2 more Caused by: java.io.IOException: java.util.concurrent.TimeoutException at org.apache.pulsar.zookeeper.GlobalZooKeeperCache.start(GlobalZooKeeperCache.java:76) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.2.jar:2.4.2] at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:572) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:366) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] ... 2 more Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_232] at org.apache.pulsar.zookeeper.GlobalZooKeeperCache.start(GlobalZooKeeperCache.java:69) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.2.jar:2.4.2] at org.apache.pulsar.broker.PulsarService.startZkCacheService(PulsarService.java:572) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] at org.apache.pulsar.broker.PulsarService.start(PulsarService.java:366) ~[org.apache.pulsar-pulsar-broker-2.4.2.jar:2.4.2] ... 2 more 04:55:22.084 [pulsar-ordered-OrderedExecutor-3-0-SendThread(zookeeper:2181)] WARN org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 30005ms for sessionid 0x0 ``` zk-2.log (new leader) ``` 04:51:47.727 [QuorumPeer[myid=3]/0.0.0.0:2181] INFO org.apache.zookeeper.server.quorum.QuorumPeer - LEADING 04:51:47.727 [QuorumPeer[myid=3]/0.0.0.0:2181] INFO org.apache.zookeeper.server.ZooKeeperServer - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir data/zookeeper/version-2 snapdir data/zookeeper/version-2 04:51:47.727 [QuorumPeer[myid=3]/0.0.0.0:2181] INFO org.apache.zookeeper.server.quorum.Leader - LEADING - LEADER ELECTION TOOK - 10421 04:51:47.734 [LearnerHandler-/10.30.2.141:43982] INFO org.apache.zookeeper.server.quorum.LearnerHandler - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@6fe50a68 ``` perf-tool output ``` 06:14:00.049 [pulsar-timer-5-1] INFO org.apache.pulsar.client.impl.ConnectionHandler - [persistent://public/default/topic-perf] [cn-beijing-105-1] Reconnecting after connection was closed 06:14:00.111 [pulsar-client-io-2-2] INFO org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xfa71e541, L:/172.17.0.2:47758 - R:/172.16.11.200:6650]] Connected to server 06:14:00.114 [main] INFO org.apache.pulsar.testclient.PerformanceProducer - Throughput produced: 0.0 msg/s --- 0.0 Mbit/s --- failure 0.0 msg/s --- Latency: mean: 0.000 ms - med: 0.000 - 95pct: 0.000 - 99pct: 0.000 - 99.9pct: 0.000 - 99.99pct: 0.000 - Max: 0.000 06:14:00.345 [pulsar-client-io-2-2] ERROR org.apache.pulsar.client.impl.ClientCnx - [id: 0xfa71e541, L:/172.17.0.2:47758 - R:/172.16.11.200:6650] Close connection because received internal-server error No active broker is available 06:14:00.346 [pulsar-client-io-2-2] WARN org.apache.pulsar.client.impl.BinaryProtoLookupService - [persistent://public/default/topic-perf] failed to send lookup request : org.apache.pulsar.client.api.PulsarClientException$LookupException: No active broker is available 06:14:00.347 [pulsar-client-io-2-2] WARN org.apache.pulsar.client.impl.ConnectionHandler - [persistent://public/default/topic-perf] [cn-beijing-105-1] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException$LookupException: No active broker is available 06:14:00.348 [pulsar-client-io-2-2] WARN org.apache.pulsar.client.impl.ConnectionHandler - [persistent://public/default/topic-perf] [cn-beijing-105-1] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$LookupException: No active broker is available -- Will try again in 57.763 s 06:14:00.348 [pulsar-client-io-2-2] INFO org.apache.pulsar.client.impl.ClientCnx - [id: 0xfa71e541, L:/172.17.0.2:47758 ! R:/172.16.11.200:6650] Disconnected ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
