ncmans opened a new issue #6603: Brokers fail when updating bookkeepr
URL: https://github.com/apache/pulsar/issues/6603
 
 
   **Describe the bug**
   
   Pulsar brokers start to fail with following error when bookkeepers are 
updating (ie in rolling manner, one at a time):
   
   ```
   03:29:25.395 [BookKeeperClientScheduler-OrderedScheduler-0-0] ERROR 
org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy - Unexpected 
exception while handling joining bookie 
pulsar-bookkeeper-0.pulsar-bookkeeper.pulsar.svc.cluster.local:3181
   java.lang.NullPointerException: null
        at 
org.apache.bookkeeper.net.NetUtils.resolveNetworkLocation(NetUtils.java:77) 
~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.resolveNetworkLocation(TopologyAwareEnsemblePlacementPolicy.java:779)
 ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.createBookieNode(TopologyAwareEnsemblePlacementPolicy.java:775)
 ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.handleBookiesThatJoined(TopologyAwareEnsemblePlacementPolicy.java:707)
 ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.handleBookiesThatJoined(RackawareEnsemblePlacementPolicyImpl.java:79)
 ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.handleBookiesThatJoined(RackawareEnsemblePlacementPolicy.java:246)
 ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.onClusterChanged(TopologyAwareEnsemblePlacementPolicy.java:654)
 ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.onClusterChanged(RackawareEnsemblePlacementPolicyImpl.java:79)
 ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.onClusterChanged(RackawareEnsemblePlacementPolicy.java:89)
 ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.client.BookieWatcherImpl.processWritableBookiesChanged(BookieWatcherImpl.java:171)
 ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.client.BookieWatcherImpl.lambda$initialBlockingBookieRead$1(BookieWatcherImpl.java:206)
 ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.discover.ZKRegistrationClient$WatchTask.accept(ZKRegistrationClient.java:139)
 [org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
org.apache.bookkeeper.discover.ZKRegistrationClient$WatchTask.accept(ZKRegistrationClient.java:62)
 [org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
 [?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 [?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
 [?:1.8.0_232]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[?:1.8.0_232]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
[?:1.8.0_232]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 [?:1.8.0_232]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 [?:1.8.0_232]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_232]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_232]
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
   03:29:28.764 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  
org.apache.bookkeeper.proto.PerChannelBookieClient - Timed-out 1 operations to 
channel [id: 0x8f82b194, L:/10.1.27.4:35010 - 
R:pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local/10.1.30.30:3181]
 for pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local:3181
   03:29:28.764 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  
org.apache.bookkeeper.client.PendingReadOp - Error: Bookie operation timeout 
while reading L1707 E0 from bookie: 
pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local:3181
   03:29:29.426 [pulsar-client-io-41-1] WARN  
org.apache.pulsar.client.impl.ConsumerImpl - 
[persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck][reader-62bafd6e50] 
Failed to subscribe to topic on /10.1.27.4:6650
   03:29:29.426 [pulsar-client-io-41-1] WARN  
org.apache.pulsar.client.impl.PulsarClientImpl - 
[persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] Failed to get create 
topic reader
   java.util.concurrent.CompletionException: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 696 lookup 
request timedout after ms 30000
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 ~[?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 ~[?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) 
~[?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701)
 ~[?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_232]
        at 
org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1023)
 ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
        at 
org.apache.pulsar.client.impl.ClientCnx.lambda$channelActive$0(ClientCnx.java:187)
 ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) 
[io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:156) 
[io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
 [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510)
 [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) 
[io.netty-netty-transport-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)
 [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
[io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
   Caused by: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 696 lookup 
request timedout after ms 30000
        at 
org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1022)
 ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
        ... 10 more
   03:29:29.427 [pulsar-client-io-41-1] WARN  
org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - 
R:/10.1.27.4:6650] request 696 timed out after 30000 ms
   03:29:29.427 [pulsar-client-io-41-1] ERROR 
org.apache.pulsar.client.impl.ProducerImpl - 
[persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] [null] Failed to create 
producer: 697 lookup request timedout after ms 30000
   03:29:29.430 [pulsar-client-io-41-1] INFO  
org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [24/Mar/2020:03:28:34 
+0000] "GET /admin/v2/brokers/health HTTP/1.1" 500 2442 "-" 
"Pulsar-Java-v2.5.0" 55407
   03:29:29.430 [pulsar-client-io-41-1] WARN  
org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - 
R:/10.1.27.4:6650] request 697 timed out after 30000 ms
   03:29:29.431 [pulsar-client-io-41-1] WARN  
org.apache.pulsar.client.impl.ConsumerImpl - 
[persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck][reader-d9165e25c4] 
Failed to subscribe to topic on /10.1.27.4:6650
   03:29:29.431 [pulsar-client-io-41-1] WARN  
org.apache.pulsar.client.impl.PulsarClientImpl - 
[persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] Failed to get create 
topic reader
   java.util.concurrent.CompletionException: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 702 lookup 
request timedout after ms 30000
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 ~[?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 ~[?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) 
~[?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701)
 ~[?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_232]
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_232]
        at 
org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1023)
 ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
        at 
org.apache.pulsar.client.impl.ClientCnx.lambda$channelActive$0(ClientCnx.java:187)
 ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) 
[io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:156) 
[io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
 [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510)
 [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) 
[io.netty-netty-transport-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)
 [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
[io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
   Caused by: 
org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 702 lookup 
request timedout after ms 30000
        at 
org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1022)
 ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
        ... 10 more
   03:29:29.433 [pulsar-client-io-41-1] WARN  
org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - 
R:/10.1.27.4:6650] request 702 timed out after 30000 ms
   03:29:29.433 [pulsar-client-io-41-1] ERROR 
org.apache.pulsar.client.impl.ProducerImpl - 
[persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] [null] Failed to create 
producer: 703 lookup request timedout after ms 30000
   03:29:29.447 [pulsar-client-io-41-1] INFO  
org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [24/Mar/2020:03:28:34 
+0000] "GET /admin/v2/brokers/health HTTP/1.1" 500 2442 "-" 
"Pulsar-Java-v2.5.0" 54952
   03:29:29.448 [pulsar-client-io-41-1] WARN  
org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - 
R:/10.1.27.4:6650] request 703 timed out after 30000 ms
   ```
   
   The healthcheck mentioned is the following command:
   
   ```
   $ /pulsar/bin/pulsar-admin brokers healthcheck
   ```
   
   NOTE: this bug was [first 
reported](https://github.com/kafkaesque-io/pulsar-helm-chart/issues/23) to the 
relevant kubernetes chart maintainers with no resolutions as of this issue.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Install the pulsar chart from 
https://github.com/kafkaesque-io/pulsar-helm-chart/tree/master/helm-chart-sources/pulsar
   2. Make a change to the bookkeeper statefulset to trigger a rolling update: 
`kubectl edit sts pulsar-bookkeeper` and add an annotation to 
`spec.template.metadata.annotations`.
   3. Watch the pods: `watch `kubectl get pod`
   
   **Expected behavior**
   
   Bookies should rolling update one by one with nothing else failing.
   
   **Actual behavior**
   
   Brokers start to fail and get restarted by the scheduler.
   
   **Additional context**
   
   Environment: Kubernetes cluster on GKE
   
   ```
   $ kubectl version
   Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", 
GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"archive", 
BuildDate:"2020-02-29T16:37:45Z", GoVersion:"go1.14", Compiler:"gc", 
Platform:"linux/amd64"}
   Server Version: version.Info{Major:"1", Minor:"14+", 
GitVersion:"v1.14.10-gke.17", 
GitCommit:"bdceba0734835c6cb1acbd1c447caf17d8613b44", GitTreeState:"clean", 
BuildDate:"2020-01-17T23:10:13Z", GoVersion:"go1.12.12b4", Compiler:"gc", 
Platform:"linux/amd64"}
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to