4onni opened a new issue #8093:
URL: https://github.com/apache/pulsar/issues/8093


   **Describe the bug**
   Client hangs forever when all brokers stop and then restart.
   
   - After all brokers get shutdown and restarted, some running clients might 
get stuck in the `lookup -> timeout -> lookup` loop: 
   ```
   2020-09-21 16:37:15.616 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:77] - 
[persistent://monitor/metrics/metrics-push-service-partition-1] 
[push-service_B_sub] Error connecting to broker: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202369 
lookup request timedout after ms 30000
   2020-09-21 16:37:15.616 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:101] 
- [persistent://monitor/metrics/metrics-push-service-partition-1] 
[push-service_B_sub] Could not get connection to broker: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202369 
lookup request timedout after ms 30000 -- Will try again in 46.51 s
   2020-09-21 16:37:15.629 +0800 [pulsar-external-listener-3-1] WARN  
[org.apache.pulsar.client.impl.PulsarClientImpl] [PulsarClientImpl.java:695] - 
[topic: persistent://monitor/metrics/metrics-bdms] Could not get connection 
while getPartitionedTopicMetadata -- Will try again in 198 ms
   2020-09-21 16:37:15.757 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.BinaryProtoLookupService] 
[BinaryProtoLookupService.java:164] - 
[persistent://monitor/metrics/metrics-rabbitmq-partition-2] failed to send 
lookup request : 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202370 
lookup request timedout after ms 30000
   2020-09-21 16:37:15.757 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:77] - 
[persistent://monitor/metrics/metrics-rabbitmq-partition-2] [rabbitmq_B_sub] 
Error connecting to broker: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202370 
lookup request timedout after ms 30000
   2020-09-21 16:37:15.757 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:101] 
- [persistent://monitor/metrics/metrics-rabbitmq-partition-2] [rabbitmq_B_sub] 
Could not get connection to broker: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202370 
lookup request timedout after ms 30000 -- Will try again in 46.895 s
   2020-09-21 16:37:15.951 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.BinaryProtoLookupService] 
[BinaryProtoLookupService.java:164] - 
[persistent://monitor/metrics/metrics-transporter-op-api-partition-2] failed to 
send lookup request : 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202371 
lookup request timedout after ms 30000
   2020-09-21 16:37:15.952 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:77] - 
[persistent://monitor/metrics/metrics-transporter-op-api-partition-2] 
[transporter-op-api_B_sub] Error connecting to broker: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202371 
lookup request timedout after ms 30000
   2020-09-21 16:37:15.952 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:101] 
- [persistent://monitor/metrics/metrics-transporter-op-api-partition-2] 
[transporter-op-api_B_sub] Could not get connection to broker: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202371 
lookup request timedout after ms 30000 -- Will try again in 46.886 s
   2020-09-21 16:37:16.295 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.BinaryProtoLookupService] 
[BinaryProtoLookupService.java:164] - 
[persistent://monitor/metrics/metrics-management-supplier-partition-0] failed 
to send lookup request : 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202372 
lookup request timedout after ms 30000
   2020-09-21 16:37:16.295 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:77] - 
[persistent://monitor/metrics/metrics-management-supplier-partition-0] 
[management-supplier_B_sub] Error connecting to broker: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202372 
lookup request timedout after ms 30000
   2020-09-21 16:37:16.295 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.ConnectionHandler] [ConnectionHandler.java:101] 
- [persistent://monitor/metrics/metrics-management-supplier-partition-0] 
[management-supplier_B_sub] Could not get connection to broker: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 202372 
lookup request timedout after ms 30000 -- Will try again in 47.718 s
   2020-09-21 16:37:17.486 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.ConnectionPool] [ConnectionPool.java:212] - 
Failed to open connection to 10.42.99.65:6650 : 
org.apache.pulsar.shade.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 finishConnect(..) failed: Connection refused: /10.42.99.65:6650
   2020-09-21 16:37:17.586 +0800 [pulsar-external-listener-3-1] WARN  
[org.apache.pulsar.client.impl.PulsarClientImpl] [PulsarClientImpl.java:695] - 
[topic: persistent://monitor/metrics/metrics-pmml-cloud] Could not get 
connection while getPartitionedTopicMetadata -- Will try again in 100 ms
   2020-09-21 16:37:17.587 +0800 [pulsar-client-io-1-1] WARN  
[org.apache.pulsar.client.impl.ConnectionPool] [ConnectionPool.java:212] - 
Failed to open connection to 10.42.97.86:6650 : 
org.apache.pulsar.shade.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 finishConnect(..) failed: Connection refused: /10.42.97.86:6650
   2020-09-21 16:37:17.771 +0800 [pulsar-external-listener-3-1] WARN  
[org.apache.pulsar.client.impl.PulsarClientImpl] [PulsarClientImpl.java:695] - 
[topic: persistent://monitor/metrics/metrics-pmml-cloud] Could not get 
connection while getPartitionedTopicMetadata -- Will try again in 184 ms
   ```
   
   - lookup request hangs on broker as well:
   ```
   [app@pulsar-ndev42-001 apache-pulsar-2.6.0]$ curl 
http://localhost:8080/lookup/v2/topic/persistent/monitor/metrics/2 -vL | jq .
     % Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current
                                    Dload  Upload   Total   Spent    Left  Speed
     0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     
0* About to connect() to localhost port 8080 (#0)
   *   Trying 127.0.0.1...
   * Connected to localhost (127.0.0.1) port 8080 (#0)
   > GET /lookup/v2/topic/persistent/monitor/metrics/2 HTTP/1.1
   > User-Agent: curl/7.29.0
   > Host: localhost:8080
   > Accept: */*
   >
     0     0    0     0    0     0      0      0 --:--:--  0:00:31 --:--:--     0
    
     0     0    0     0    0     0      0      0 --:--:--  0:00:36 --:--:--     0
   ```
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. When running, shutdown all the brokers
   2. Start them
   3. A set of clients hangs
   
   **Expected behavior**
   Client should auto-recover after broker restarts
   
   **Screenshots**
   See logs above
   
   **Desktop (please complete the following information):**
   Pulsar Version: v2.6.0
   
   **Additional context**
   There are several steps need to be finished before the broker can be fully 
started, as illustrated in the pseudo code below:
   ```
   PulsarService#start():
       broker.start(); // Step 1
       webService.start(); // Step 2
       leaderElectionService.start(); //Step 3
   ```
   If a lookup request gets in between Step 2 and Step 3, a NPE would be 
thrown, which will block all other coming requests from getting processed 
properly.
   We will submit a PR shortly to fix this bug.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to