[GitHub] [pulsar] rauluka opened a new issue #8008: Can't send message when 1 zookeeper, bookie and broker are down due to PulsarClientException$TimeoutException

GitBox Tue, 08 Sep 2020 07:30:09 -0700


rauluka opened a new issue #8008:
URL: https://github.com/apache/pulsar/issues/8008



   **Issue description**
   Our setup:
   Pulsar deployed on AKS cluster (version 1.17.7).
   Pulsar components:
    5 bookies PODs
    3 brokers PODs
    3 zookeepers PODs
    3 pulsar proxies PODs
   
   When one of the AKS nodes is drained (all PODs are moved) we cannot send 
message to Pulsar due to PulsarClientException$TimeoutException.
   
   Actually if one POD of zookeeper, bookie, broker is killed at the same time 
it's impossible to send a message to Pulsar. Even though all components have 
needed quorums and redundancy (we still have 2 Zookeepers, 4 bookies and 2 
brokers in such cases).
   
   Stacktrace:
   
   ```
   
"stack_trace":"org.apache.pulsar.client.api.PulsarClientException$TimeoutException:
 The producer pulsar-49-52 can not send message to the topic 
persistent://public/XXXXXXX within given timeout
   \tat org.apache.pulsar.client.impl.ProducerImpl.run(ProducerImpl.java:1431)
   \t... 5 common frames omitted
   Wrapped by: java.util.concurrent.CompletionException: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: The 
producer pulsar-49-52 can not send message to the topic persistent://publicXXX 
within given timeout
   \tat 
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
   \tat 
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
   \tat 
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
   \tat 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
   \tat 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
   \tat 
org.apache.pulsar.client.impl.ProducerImpl$1.sendComplete(ProducerImpl.java:305)
   \tat 
org.apache.pulsar.client.impl.ProducerImpl.lambda$failPendingMessages$18(ProducerImpl.java:1465)
   \tat 
java.util.concurrent.ArrayBlockingQueue.forEach(ArrayBlockingQueue.java:1456)
   \tat 
org.apache.pulsar.client.impl.ProducerImpl.failPendingMessages(ProducerImpl.java:1455)
   \tat org.apache.pulsar.client.impl.ProducerImpl.run(ProducerImpl.java:1433)
   \tat 
org.apache.pulsar.shade.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
   \tat 
org.apache.pulsar.shade.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
   \tat 
org.apache.pulsar.shade.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
   \tat 
org.apache.pulsar.shade.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
   \tat java.lang.Thread.run(Thread.java:834)
   ```
   
   
   ```
   {
       "logLevel": "ERROR",
       "logThread": "BookKeeperClientWorker-OrderedExecutor-0-0",
       "logger": "org.apache.bookkeeper.client.ReadLastConfirmedOp",
       "message": "While readLastConfirmed ledger: 9166 did not hear success 
responses from all quorums",
       "stack_trace": null
   }
   ```
   
    ```
   
{"logLevel":"ERROR","logThread":"ReplicationWorker","logger":"org.apache.bookkeeper.replication.ReplicationWorker","message":"UnavailableException
 while replicating 
fragments","stack_trace":"org.apache.bookkeeper.replication.ReplicationException$UnavailableException:
 Error contacting zookeeper
   \tat 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.isLedgerReplicationEnabled(ZkLedgerUnderreplicationManager.java:728)
   \tat 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.waitIfLedgerReplicationDisabled(ZkLedgerUnderreplicationManager.java:619)
   \tat 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:600)
   \tat 
org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:272)
   \tat 
org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:238)
   \tat 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
   \tat java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for /ledgers/underreplication/disable
   \tat org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
   \tat org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
   \tat org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2021)
   \tat 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2301(ZooKeeperClient.java:70)
   \tat 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:830)
   \tat 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:824)
   \tat 
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
   \tat 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:824)
   \tat org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2049)
   \tat 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2401(ZooKeeperClient.java:70)
   \tat 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:851)
   \tat 
org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:845)
   \tat 
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
   \tat 
org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:845)
   \tat 
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.isLedgerReplicationEnabled(ZkLedgerUnderreplicationManager.java:723)
   \t... 6 more
   "}
   ```
   
   
   **To Reproduce**
   Setup a Pulsar cluster on AKS with redundant setup:
    5 bookies PODs
    3 brokers PODs
    3 zookeepers PODs
    3 pulsar proxies PODs
   
   Kill one zookeeper, broker and bookie POD at the same time.
   
   **Expected behavior**
   As we have redundancy on each level (zookeepers, brokers, bookies, proxies) 
we expect that when only one POD of each kind is down Pulsar is fully 
operational.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] rauluka opened a new issue #8008: Can't send message when 1 zookeeper, bookie and broker are down due to PulsarClientException$TimeoutException

Reply via email to