lhotari commented on issue #6259:
URL: https://github.com/apache/pulsar/issues/6259#issuecomment-719460964


   I was running into some issues where there were timeouts in the Pulsar 
client application logs. The Pulsar brokers were restarting. This was happening 
in a load test and it seemed that the broker restarts made the problem worse.
   
   To mitigate the issue, the brokers were recently configured to use 
zookeeperSessionExpiredPolicy=reconnect setting. The broker zookeeper timeout 
related settings are at the default values 
(zooKeeperSessionTimeoutMillis=30000, zooKeeperOperationTimeoutSeconds=30), 
since it seems odd that the Zookeeper interactions would take longer since the 
CPU consumption looks very low on zk in the load test. 
   
   Now, we reran the load test and the brokers became unavailable. The logs 
were filled with this type of errors:
   ```
   07:37:09.740 [pulsar-zk-session-watcher-5-1] ERROR 
org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for 
reconnecting, invoking shutdown service
   07:37:11.740 [pulsar-zk-session-watcher-5-1] ERROR 
org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for 
reconnecting, invoking shutdown service
   07:37:13.741 [pulsar-zk-session-watcher-5-1] ERROR 
org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for 
reconnecting, invoking shutdown service
   07:37:15.741 [pulsar-zk-session-watcher-5-1] ERROR 
org.apache.pulsar.zookeeper.ZooKeeperSessionWatcher - timeout expired for 
reconnecting, invoking shutdown service
   ```
   This is with `zookeeperSessionExpiredPolicy=reconnect`. Previously with the 
default setting (`zookeeperSessionExpiredPolicy=shutdown`), the broker would 
resume operations after a restart. Now all brokers in a 3 node cluster get 
unavailable.
   
   The benefit of using `zookeeperSessionExpiredPolicy=reconnect` is that now 
it was possible to get a thread dump from the system when it got unavailable, 
although the broker didn't ever resume operations.
   
   In the thread dump, there are quite a few hanging threads. Some stack traces 
are similar to what #8406 fixes. 
   ```
   "ForkJoinPool.commonPool-worker-7" #1258 daemon prio=5 os_prio=0 
tid=0x00007f67a09f9000 nid=0x50a waiting on condition [0x00007f676c258000]
      java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000044ca60338> (a 
java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
        at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3313)
        at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
        at 
com.github.benmanes.caffeine.cache.LocalAsyncLoadingCache$LoadingCacheView.get(LocalAsyncLoadingCache.java:400)
        at 
org.apache.pulsar.common.naming.NamespaceBundleFactory.getBundles(NamespaceBundleFactory.java:160)
        at 
org.apache.pulsar.broker.namespace.NamespaceService.getBundle(NamespaceService.java:192)
        at 
org.apache.pulsar.broker.namespace.NamespaceService.isTopicOwned(NamespaceService.java:913)
        at 
org.apache.pulsar.broker.namespace.NamespaceService.isServiceUnitOwned(NamespaceService.java:877)
        at 
org.apache.pulsar.broker.service.BrokerService.checkTopicNsOwnership(BrokerService.java:1359)
        at 
org.apache.pulsar.broker.service.persistent.PersistentTopic.subscribe(PersistentTopic.java:521)
   ```
   Full thread dump of broker:
   https://gist.github.com/lhotari/ba9c45cf4111a66f5e8960df4955bbf8
   Online thread dump report: 
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMjAvMTAvMzAvLS1icm9rZXJfdGhyZWFkZHVtcC50eHQtLTEwLTAtMzc=&;
   
   The reason why I'm bringing this up in this issue is that a Zookeeper 
timeout exception in the broker logs can actually be caused by threads hanging 
in the broker. Hopefully #8406 gets also fixed in the upcoming Pulsar 2.6.2 
version


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to