ncmans opened a new issue #6603: Brokers fail when updating bookkeepr URL: https://github.com/apache/pulsar/issues/6603 **Describe the bug** Pulsar brokers start to fail with following error when bookkeepers are updating (ie in rolling manner, one at a time): ``` 03:29:25.395 [BookKeeperClientScheduler-OrderedScheduler-0-0] ERROR org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy - Unexpected exception while handling joining bookie pulsar-bookkeeper-0.pulsar-bookkeeper.pulsar.svc.cluster.local:3181 java.lang.NullPointerException: null at org.apache.bookkeeper.net.NetUtils.resolveNetworkLocation(NetUtils.java:77) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.resolveNetworkLocation(TopologyAwareEnsemblePlacementPolicy.java:779) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.createBookieNode(TopologyAwareEnsemblePlacementPolicy.java:775) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.handleBookiesThatJoined(TopologyAwareEnsemblePlacementPolicy.java:707) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.handleBookiesThatJoined(RackawareEnsemblePlacementPolicyImpl.java:79) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.handleBookiesThatJoined(RackawareEnsemblePlacementPolicy.java:246) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.onClusterChanged(TopologyAwareEnsemblePlacementPolicy.java:654) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.onClusterChanged(RackawareEnsemblePlacementPolicyImpl.java:79) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.onClusterChanged(RackawareEnsemblePlacementPolicy.java:89) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.client.BookieWatcherImpl.processWritableBookiesChanged(BookieWatcherImpl.java:171) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.client.BookieWatcherImpl.lambda$initialBlockingBookieRead$1(BookieWatcherImpl.java:206) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.discover.ZKRegistrationClient$WatchTask.accept(ZKRegistrationClient.java:139) [org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at org.apache.bookkeeper.discover.ZKRegistrationClient$WatchTask.accept(ZKRegistrationClient.java:62) [org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) [?:1.8.0_232] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) [?:1.8.0_232] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) [?:1.8.0_232] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_232] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_232] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_232] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_232] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232] 03:29:28.764 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO org.apache.bookkeeper.proto.PerChannelBookieClient - Timed-out 1 operations to channel [id: 0x8f82b194, L:/10.1.27.4:35010 - R:pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local/10.1.30.30:3181] for pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local:3181 03:29:28.764 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO org.apache.bookkeeper.client.PendingReadOp - Error: Bookie operation timeout while reading L1707 E0 from bookie: pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local:3181 03:29:29.426 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck][reader-62bafd6e50] Failed to subscribe to topic on /10.1.27.4:6650 03:29:29.426 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.PulsarClientImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] Failed to get create topic reader java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 696 lookup request timedout after ms 30000 at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_232] at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1023) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0] at org.apache.pulsar.client.impl.ClientCnx.lambda$channelActive$0(ClientCnx.java:187) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0] at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:156) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [io.netty-netty-transport-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232] Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 696 lookup request timedout after ms 30000 at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1022) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0] ... 10 more 03:29:29.427 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 696 timed out after 30000 ms 03:29:29.427 [pulsar-client-io-41-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] [null] Failed to create producer: 697 lookup request timedout after ms 30000 03:29:29.430 [pulsar-client-io-41-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [24/Mar/2020:03:28:34 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 500 2442 "-" "Pulsar-Java-v2.5.0" 55407 03:29:29.430 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 697 timed out after 30000 ms 03:29:29.431 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck][reader-d9165e25c4] Failed to subscribe to topic on /10.1.27.4:6650 03:29:29.431 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.PulsarClientImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] Failed to get create topic reader java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 702 lookup request timedout after ms 30000 at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_232] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_232] at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1023) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0] at org.apache.pulsar.client.impl.ClientCnx.lambda$channelActive$0(ClientCnx.java:187) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0] at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:156) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [io.netty-netty-transport-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232] Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 702 lookup request timedout after ms 30000 at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1022) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0] ... 10 more 03:29:29.433 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 702 timed out after 30000 ms 03:29:29.433 [pulsar-client-io-41-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] [null] Failed to create producer: 703 lookup request timedout after ms 30000 03:29:29.447 [pulsar-client-io-41-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [24/Mar/2020:03:28:34 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 500 2442 "-" "Pulsar-Java-v2.5.0" 54952 03:29:29.448 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 703 timed out after 30000 ms ``` The healthcheck mentioned is the following command: ``` $ /pulsar/bin/pulsar-admin brokers healthcheck ``` NOTE: this bug was [first reported](https://github.com/kafkaesque-io/pulsar-helm-chart/issues/23) to the relevant kubernetes chart maintainers with no resolutions as of this issue. **To Reproduce** Steps to reproduce the behavior: 1. Install the pulsar chart from https://github.com/kafkaesque-io/pulsar-helm-chart/tree/master/helm-chart-sources/pulsar 2. Make a change to the bookkeeper statefulset to trigger a rolling update: `kubectl edit sts pulsar-bookkeeper` and add an annotation to `spec.template.metadata.annotations`. 3. Watch the pods: `watch `kubectl get pod` **Expected behavior** Bookies should rolling update one by one with nothing else failing. **Actual behavior** Brokers start to fail and get restarted by the scheduler. **Additional context** Environment: Kubernetes cluster on GKE ``` $ kubectl version Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"archive", BuildDate:"2020-02-29T16:37:45Z", GoVersion:"go1.14", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.17", GitCommit:"bdceba0734835c6cb1acbd1c447caf17d8613b44", GitTreeState:"clean", BuildDate:"2020-01-17T23:10:13Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"} ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services