lhotari commented on issue #6259:
URL: https://github.com/apache/pulsar/issues/6259#issuecomment-719460964
I was running into some issues where there were timeouts in the Pulsar
client application logs. The Pulsar brokers were restarting. This was happening
in a load test and it seemed that the broker restarts made the problem worse.
To mitigate the issue, the brokers were recently configured to use
zookeeperSessionExpiredPolicy=reconnect setting. The broker zookeeper timeout
related settings are at the default values
(zooKeeperSessionTimeoutMillis=30000, zooKeeperOperationTimeoutSeconds=30),
since it seems odd that the Zookeeper interactions would take longer since the
CPU consumption looks very low on zk in the load test.
Now, we reran the load test and the brokers became unavailable. The logs
were filled with this type of errors:
```
07:37:09.740 [pulsar-zk-session-watcher-5-1] ERROR
org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for
reconnecting, invoking shutdown service
07:37:11.740 [pulsar-zk-session-watcher-5-1] ERROR
org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for
reconnecting, invoking shutdown service
07:37:13.741 [pulsar-zk-session-watcher-5-1] ERROR
org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for
reconnecting, invoking shutdown service
07:37:15.741 [pulsar-zk-session-watcher-5-1] ERROR
org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for
reconnecting, invoking shutdown service
```
This is with `zookeeperSessionExpiredPolicy=reconnect`. Previously with the
default setting (`zookeeperSessionExpiredPolicy=shutdown`), the broker would
resume operations after a restart. Now all brokers in a 3 node cluster get
unavailable.
The benefit of using `zookeeperSessionExpiredPolicy=reconnect` is that now
it was possible to get a thread dump from the system when it got unavailable,
although the broker didn't ever resume operations.
In the thread dump, there are quite a few hanging threads. Some stack traces
are similar to what #8406 fixes.
```
"ForkJoinPool.commonPool-worker-7" #1258 daemon prio=5 os_prio=0
tid=0x00007f67a09f9000 nid=0x50a waiting on condition [0x00007f676c258000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000044ca60338> (a
java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
at
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at
com.github.benmanes.caffeine.cache.LocalAsyncLoadingCache$LoadingCacheView.get(LocalAsyncLoadingCache.java:400)
at
org.apache.pulsar.common.naming.NamespaceBundleFactory.getBundles(NamespaceBundleFactory.java:160)
at
org.apache.pulsar.broker.namespace.NamespaceService.getBundle(NamespaceService.java:192)
at
org.apache.pulsar.broker.namespace.NamespaceService.isTopicOwned(NamespaceService.java:913)
at
org.apache.pulsar.broker.namespace.NamespaceService.isServiceUnitOwned(NamespaceService.java:877)
at
org.apache.pulsar.broker.service.BrokerService.checkTopicNsOwnership(BrokerService.java:1359)
at
org.apache.pulsar.broker.service.persistent.PersistentTopic.subscribe(PersistentTopic.java:521)
```
Full thread dump of broker:
https://gist.github.com/lhotari/ba9c45cf4111a66f5e8960df4955bbf8
Online thread dump report:
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjAvMTAvMzAvLS1icm9rZXJfdGhyZWFkZHVtcC50eHQtLTEwLTAtMzc=&
The reason why I'm bringing this up in this issue is that a Zookeeper
timeout exception in the broker logs can actually be caused by threads hanging
in the broker. Hopefully #8406 gets also fixed in the upcoming Pulsar 2.6.2
version
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]