4onni opened a new issue #8093: URL: https://github.com/apache/pulsar/issues/8093
**Describe the bug** Client hangs forever when all brokers stop and then restart. - After all brokers get shutdown and restarted, some running clients might get stuck in the `lookup -> timeout -> lookup` loop: ``` 2020-09-21 16:37:15.616 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:77] - [persistent://monitor/metrics/metrics-push-service-partition-1] [push-service_B_sub] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202369 lookup request timedout after ms 30000 2020-09-21 16:37:15.616 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:101] - [persistent://monitor/metrics/metrics-push-service-partition-1] [push-service_B_sub] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202369 lookup request timedout after ms 30000 -- Will try again in 46.51 s 2020-09-21 16:37:15.629 +0800 [pulsar-external-listener-3-1] WARN [org.apache.pulsar.client.impl.PulsarClientImpl] [PulsarClientImpl.java:695] - [topic: persistent://monitor/metrics/metrics-bdms] Could not get connection while getPartitionedTopicMetadata -- Will try again in 198 ms 2020-09-21 16:37:15.757 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.BinaryProtoLookupService] [BinaryProtoLookupService.java:164] - [persistent://monitor/metrics/metrics-rabbitmq-partition-2] failed to send lookup request : org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202370 lookup request timedout after ms 30000 2020-09-21 16:37:15.757 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:77] - [persistent://monitor/metrics/metrics-rabbitmq-partition-2] [rabbitmq_B_sub] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202370 lookup request timedout after ms 30000 2020-09-21 16:37:15.757 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:101] - [persistent://monitor/metrics/metrics-rabbitmq-partition-2] [rabbitmq_B_sub] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202370 lookup request timedout after ms 30000 -- Will try again in 46.895 s 2020-09-21 16:37:15.951 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.BinaryProtoLookupService] [BinaryProtoLookupService.java:164] - [persistent://monitor/metrics/metrics-transporter-op-api-partition-2] failed to send lookup request : org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202371 lookup request timedout after ms 30000 2020-09-21 16:37:15.952 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:77] - [persistent://monitor/metrics/metrics-transporter-op-api-partition-2] [transporter-op-api_B_sub] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202371 lookup request timedout after ms 30000 2020-09-21 16:37:15.952 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:101] - [persistent://monitor/metrics/metrics-transporter-op-api-partition-2] [transporter-op-api_B_sub] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202371 lookup request timedout after ms 30000 -- Will try again in 46.886 s 2020-09-21 16:37:16.295 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.BinaryProtoLookupService] [BinaryProtoLookupService.java:164] - [persistent://monitor/metrics/metrics-management-supplier-partition-0] failed to send lookup request : org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202372 lookup request timedout after ms 30000 2020-09-21 16:37:16.295 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:77] - [persistent://monitor/metrics/metrics-management-supplier-partition-0] [management-supplier_B_sub] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202372 lookup request timedout after ms 30000 2020-09-21 16:37:16.295 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:101] - [persistent://monitor/metrics/metrics-management-supplier-partition-0] [management-supplier_B_sub] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202372 lookup request timedout after ms 30000 -- Will try again in 47.718 s 2020-09-21 16:37:17.486 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.ConnectionPool] [ConnectionPool.java:212] - Failed to open connection to 10.42.99.65:6650 : org.apache.pulsar.shade.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.42.99.65:6650 2020-09-21 16:37:17.586 +0800 [pulsar-external-listener-3-1] WARN [org.apache.pulsar.client.impl.PulsarClientImpl] [PulsarClientImpl.java:695] - [topic: persistent://monitor/metrics/metrics-pmml-cloud] Could not get connection while getPartitionedTopicMetadata -- Will try again in 100 ms 2020-09-21 16:37:17.587 +0800 [pulsar-client-io-1-1] WARN [org.apache.pulsar.client.impl.ConnectionPool] [ConnectionPool.java:212] - Failed to open connection to 10.42.97.86:6650 : org.apache.pulsar.shade.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.42.97.86:6650 2020-09-21 16:37:17.771 +0800 [pulsar-external-listener-3-1] WARN [org.apache.pulsar.client.impl.PulsarClientImpl] [PulsarClientImpl.java:695] - [topic: persistent://monitor/metrics/metrics-pmml-cloud] Could not get connection while getPartitionedTopicMetadata -- Will try again in 184 ms ``` - lookup request hangs on broker as well: ``` [app@pulsar-ndev42-001 apache-pulsar-2.6.0]$ curl http://localhost:8080/lookup/v2/topic/persistent/monitor/metrics/2 -vL | jq . % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* About to connect() to localhost port 8080 (#0) * Trying 127.0.0.1... * Connected to localhost (127.0.0.1) port 8080 (#0) > GET /lookup/v2/topic/persistent/monitor/metrics/2 HTTP/1.1 > User-Agent: curl/7.29.0 > Host: localhost:8080 > Accept: */* > 0 0 0 0 0 0 0 0 --:--:-- 0:00:31 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:36 --:--:-- 0 ``` **To Reproduce** Steps to reproduce the behavior: 1. When running, shutdown all the brokers 2. Start them 3. A set of clients hangs **Expected behavior** Client should auto-recover after broker restarts **Screenshots** See logs above **Desktop (please complete the following information):** Pulsar Version: v2.6.0 **Additional context** There are several steps need to be finished before the broker can be fully started, as illustrated in the pseudo code below: ``` PulsarService#start(): broker.start(); // Step 1 webService.start(); // Step 2 leaderElectionService.start(); //Step 3 ``` If a lookup request gets in between Step 2 and Step 3, a NPE would be thrown, which will block all other coming requests from getting processed properly. We will submit a PR shortly to fix this bug. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
