rauluka opened a new issue #8008:
URL: https://github.com/apache/pulsar/issues/8008
**Issue description**
Our setup:
Pulsar deployed on AKS cluster (version 1.17.7).
Pulsar components:
5 bookies PODs
3 brokers PODs
3 zookeepers PODs
3 pulsar proxies PODs
When one of the AKS nodes is drained (all PODs are moved) we cannot send
message to Pulsar due to PulsarClientException$TimeoutException.
Actually if one POD of zookeeper, bookie, broker is killed at the same time
it's impossible to send a message to Pulsar. Even though all components have
needed quorums and redundancy (we still have 2 Zookeepers, 4 bookies and 2
brokers in such cases).
Stacktrace:
```
"stack_trace":"org.apache.pulsar.client.api.PulsarClientException$TimeoutException:
The producer pulsar-49-52 can not send message to the topic
persistent://public/XXXXXXX within given timeout
\tat org.apache.pulsar.client.impl.ProducerImpl.run(ProducerImpl.java:1431)
\t... 5 common frames omitted
Wrapped by: java.util.concurrent.CompletionException:
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: The
producer pulsar-49-52 can not send message to the topic persistent://publicXXX
within given timeout
\tat
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
\tat
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
\tat
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
\tat
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
\tat
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
\tat
org.apache.pulsar.client.impl.ProducerImpl$1.sendComplete(ProducerImpl.java:305)
\tat
org.apache.pulsar.client.impl.ProducerImpl.lambda$failPendingMessages$18(ProducerImpl.java:1465)
\tat
java.util.concurrent.ArrayBlockingQueue.forEach(ArrayBlockingQueue.java:1456)
\tat
org.apache.pulsar.client.impl.ProducerImpl.failPendingMessages(ProducerImpl.java:1455)
\tat org.apache.pulsar.client.impl.ProducerImpl.run(ProducerImpl.java:1433)
\tat
org.apache.pulsar.shade.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
\tat
org.apache.pulsar.shade.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
\tat
org.apache.pulsar.shade.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
\tat
org.apache.pulsar.shade.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
\tat java.lang.Thread.run(Thread.java:834)
```
```
{
"logLevel": "ERROR",
"logThread": "BookKeeperClientWorker-OrderedExecutor-0-0",
"logger": "org.apache.bookkeeper.client.ReadLastConfirmedOp",
"message": "While readLastConfirmed ledger: 9166 did not hear success
responses from all quorums",
"stack_trace": null
}
```
```
{"logLevel":"ERROR","logThread":"ReplicationWorker","logger":"org.apache.bookkeeper.replication.ReplicationWorker","message":"UnavailableException
while replicating
fragments","stack_trace":"org.apache.bookkeeper.replication.ReplicationException$UnavailableException:
Error contacting zookeeper
\tat
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.isLedgerReplicationEnabled(ZkLedgerUnderreplicationManager.java:728)
\tat
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.waitIfLedgerReplicationDisabled(ZkLedgerUnderreplicationManager.java:619)
\tat
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.getLedgerToRereplicate(ZkLedgerUnderreplicationManager.java:600)
\tat
org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:272)
\tat
org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:238)
\tat
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
\tat java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /ledgers/underreplication/disable
\tat org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
\tat org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
\tat org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2021)
\tat
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2301(ZooKeeperClient.java:70)
\tat
org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:830)
\tat
org.apache.bookkeeper.zookeeper.ZooKeeperClient$13.call(ZooKeeperClient.java:824)
\tat
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
\tat
org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:824)
\tat org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2049)
\tat
org.apache.bookkeeper.zookeeper.ZooKeeperClient.access$2401(ZooKeeperClient.java:70)
\tat
org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:851)
\tat
org.apache.bookkeeper.zookeeper.ZooKeeperClient$14.call(ZooKeeperClient.java:845)
\tat
org.apache.bookkeeper.zookeeper.ZooWorker.syncCallWithRetries(ZooWorker.java:140)
\tat
org.apache.bookkeeper.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:845)
\tat
org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager.isLedgerReplicationEnabled(ZkLedgerUnderreplicationManager.java:723)
\t... 6 more
"}
```
**To Reproduce**
Setup a Pulsar cluster on AKS with redundant setup:
5 bookies PODs
3 brokers PODs
3 zookeepers PODs
3 pulsar proxies PODs
Kill one zookeeper, broker and bookie POD at the same time.
**Expected behavior**
As we have redundancy on each level (zookeepers, brokers, bookies, proxies)
we expect that when only one POD of each kind is down Pulsar is fully
operational.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]