devinbost opened a new issue #6151: Intermittent 500's from Admin CLI and Zookeeper exceptions in broker logs URL: https://github.com/apache/pulsar/issues/6151 **Describe the bug** Periodically, running various Admin CLI commands results in a 500 error. Usually, when we run the command again several seconds later, the error does not reoccur. However, sometimes the problem seems to get worse and result in consistent 500 errors for a period of time. Sometimes, restarting brokers seems to resolve the issue. **To Reproduce** Not sure what usually causes this. In the most recent case (where the 500 errors became more consistent), there was a ZK cluster change that seemed to trigger the issue. **Logs** We found errors in broker logs: ``` 22 1/27/20 6:46:37.000 PM Jan 27 18:46:37 host05 185f08be9b1f: 01:46:37.153 [pulsar-web-33-68] INFO org.eclipse.jetty.server.RequestLog - 10.20.17.42 - - [28/Jan/2020:01:46:22 +0000] "GET /admin/v2/namespaces/production123/ingest/permissions HTTP/1.1" 500 7110 "-" "Jersey/2.27 (HttpUrlConnection 1.8.0_181)" 15009 host = host05.example.domain.comsource = /var/log/messagessourcetype = syslog 23 1/27/20 6:46:37.000 PM Jan 27 18:46:37 host05 185f08be9b1f: 01:46:37.147 [pulsar-web-33-68] ERROR org.apache.pulsar.broker.web.PulsarWebResource - Failed to get tenant admin data for tenant production123 eventtype = err0r errorhost = host05.example.domain.comsource = /var/log/messagessourcetype = syslogtag = error ``` Here's what happens when we execute CLI commands: ``` [(prod) [email protected] ~]# pulsar-admin --admin-url http://load-balanced-url.example.domain.com:8080 namespaces permissions production123/ingest { } [(prod) [email protected] ~]# pulsar-admin --admin-url http://load-balanced-url.example.domain.com:8080 namespaces permissions production123/ingest HTTP 500 Internal Server Error Reason: HTTP 500 Internal Server Error ``` We also get it with clusters: ``` [(prod) [email protected] ~]# pulsar-admin --admin-url http://load-balanced-url.example.domain.com:8080 clusters list HTTP 500 Internal Server Error Reason: HTTP 500 Internal Server Error [(prod) [email protected] ~]# pulsar-admin --admin-url http://load-balanced-url.example.domain.com:8080 clusters list pulsar-proddatacenter1 pulsar-proddatacenter2 ``` We also found these in the broker logs: ``` 02:35:44.383 [prometheus-stats-42-1] ERROR org.apache.pulsar.broker.service.BacklogQuotaManager - Failed to read policies data, will apply the default backlog quota: namespace=production456/ingest java.util.concurrent.ExecutionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) ~[?:1.8.0_212] at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:95) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.service.BacklogQuotaManager.getBacklogQuota(BacklogQuotaManager.java:63) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.service.persistent.PersistentTopic.getBacklogQuota(PersistentTopic.java:1671) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.stats.prometheus.NamespaceStatsAggregator.getTopicStats(NamespaceStatsAggregator.java:95) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.stats.prometheus.NamespaceStatsAggregator.lambda$null$0(NamespaceStatsAggregator.java:63) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:386) ~[org.apache.pulsar-pulsar-common-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:160) ~[org.apache.pulsar-pulsar-common-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.stats.prometheus.NamespaceStatsAggregator.lambda$null$1(NamespaceStatsAggregator.java:62) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:386) ~[org.apache.pulsar-pulsar-common-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:160) ~[org.apache.pulsar-pulsar-common-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.stats.prometheus.NamespaceStatsAggregator.lambda$generate$2(NamespaceStatsAggregator.java:61) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:386) ~[org.apache.pulsar-pulsar-common-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:160) ~[org.apache.pulsar-pulsar-common-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.stats.prometheus.NamespaceStatsAggregator.generate(NamespaceStatsAggregator.java:58) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsGenerator.generate(PrometheusMetricsGenerator.java:73) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet.lambda$doGet$0(PrometheusMetricsServlet.java:70) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [org.apache.pulsar-managed-ledger-original-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.9.1-streamlio-5.jar:4.9.1-streamlio-5] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_212] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212] Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.pulsar-pulsar-zookeeper-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.zookeeper.KeeperException.create(KeeperException.java:76) ~[org.apache.pulsar-pulsar-zookeeper-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$13(ZooKeeperCache.java:341) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) ~[?:1.8.0_212] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[?:1.8.0_212] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[?:1.8.0_212] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[?:1.8.0_212] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) ~[?:1.8.0_212] 02:35:44.384 [ForkJoinPool.commonPool-worker-52] WARN org.apache.pulsar.broker.service.ServerCnx - Failed to get Partitioned Metadata [/127.0.0.1:48432] persistent://prodTenant1/prodNamespace1/prodTopic1: HTTP 500 Internal Server Error org.apache.pulsar.broker.web.RestException: HTTP 500 Internal Server Error at org.apache.pulsar.broker.web.PulsarWebResource.lambda$checkLocalOrGetPeerReplicationCluster$5(PulsarWebResource.java:686) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_212] at org.apache.pulsar.zookeeper.ZooKeeperDataCache.lambda$2(ZooKeeperDataCache.java:70) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_212] at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$15(ZooKeeperCache.java:357) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_212] at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$13(ZooKeeperCache.java:341) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) [?:1.8.0_212] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_212] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_212] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_212] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) [?:1.8.0_212] 02:35:44.384 [ForkJoinPool.commonPool-worker-22] WARN org.apache.pulsar.broker.service.ServerCnx - Failed to get Partitioned Metadata [/127.0.0.1:48428] persistent://prodTenant5/taps/prodTopic125-backfill: HTTP 500 Internal Server Error org.apache.pulsar.broker.web.RestException: HTTP 500 Internal Server Error at org.apache.pulsar.broker.web.PulsarWebResource.lambda$checkLocalOrGetPeerReplicationCluster$5(PulsarWebResource.java:686) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_212] at org.apache.pulsar.zookeeper.ZooKeeperDataCache.lambda$2(ZooKeeperDataCache.java:70) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_212] at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$15(ZooKeeperCache.java:357) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_212] at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$13(ZooKeeperCache.java:341) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) [?:1.8.0_212] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_212] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_212] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_212] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) [?:1.8.0_212] 02:35:44.388 [pulsar-backlog-quota-checker-29-1] ERROR org.apache.pulsar.broker.service.BacklogQuotaManager - Failed to read policies data, will apply the default backlog quota: namespace=ops/deployment java.util.concurrent.ExecutionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) ~[?:1.8.0_212] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) ~[?:1.8.0_212] at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:95) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.service.BacklogQuotaManager.getBacklogQuota(BacklogQuotaManager.java:63) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.service.BacklogQuotaManager.getBacklogQuotaLimit(BacklogQuotaManager.java:75) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.service.BrokerService.isBacklogExceeded(BrokerService.java:926) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.service.BrokerService.lambda$monitorBacklogQuota$30(BrokerService.java:944) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.service.BrokerService.lambda$forEachTopic$29(BrokerService.java:909) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.forEach(ConcurrentOpenHashMap.java:386) ~[org.apache.pulsar-pulsar-common-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.forEach(ConcurrentOpenHashMap.java:160) ~[org.apache.pulsar-pulsar-common-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.service.BrokerService.forEachTopic(BrokerService.java:906) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.broker.service.BrokerService.monitorBacklogQuota(BrokerService.java:941) ~[org.apache.pulsar-pulsar-broker-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [org.apache.pulsar-managed-ledger-original-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.9.1-streamlio-5.jar:4.9.1-streamlio-5] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_212] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_212] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212] Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[org.apache.pulsar-pulsar-zookeeper-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.zookeeper.KeeperException.create(KeeperException.java:76) ~[org.apache.pulsar-pulsar-zookeeper-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$13(ZooKeeperCache.java:341) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.4.0-streamlio-24.jar:2.4.0-streamlio-24] at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) ~[?:1.8.0_212] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[?:1.8.0_212] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[?:1.8.0_212] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[?:1.8.0_212] ``` We're not seeing errors in Zookeeper logs. **Environment:** Hosts are all bare metal `CentOS Linux 7 (Core)` running docker instances: Brokers are all running: `streamlio/pulsar-all:2.4.0-streamlio-24` Bookkeepers are all running: `streamlio/pulsar-all:2.4.0-streamlio-24` except for two, which are running: `apachepulsar/pulsar-all:2.3.0` Zookeepers are all running: `streamlio/pulsar-all:2.4.0-streamlio-24` except for two, which are running: `apachepulsar/pulsar-all:2.3.0` **Additional context** TLDR. We're seeing 500 errors, org.apache.pulsar.broker.web.PulsarWebResource errors, org.apache.pulsar.broker.service.ServerCnx metadata warnings, and we're also seeing the Zookeeper exception `org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss` showing up in the broker logs.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
