hepyu opened a new issue, #16327: URL: https://github.com/apache/pulsar/issues/16327
**Directory** **(1).pulsar cluster info** **(2).accident site statement** **(3).I tracked** **(4). I guess** **(1).pulsar cluster info** Pulsar cluster is deployed in aws ec2, six ec2 totally. 3 ec2: ec2 is 4c8g, only deployed broker per ec2. 3 ec2: ec2 is 8c16g, deployed 1 broker and 1 bookie per ec2. pulsar version: 2.8.0 **(2).accident site statement** One broker of cluster occur bundle unload error exception stack info at 00:12:31 as this: ``` 00:12:31.855 [AsyncHttpClient-48-1] WARN org.apache.pulsar.client.admin.internal.BaseResource - [http://10.0.44.174:8080/admin/v2/namespaces/xxx/public/0xb0000000_0xc0000000/unload] Failed to perform http put request: javax.ws.rs.InternalServerErrorException: HTTP 500 Internal Server Error 00:12:31.858 [pulsar-load-manager-1-1] WARN org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Error when trying to perform load shedding on xxx/public/0xb0000000_0xc0000000 for broker 10.0.37.216:8080 org.apache.pulsar.client.admin.PulsarAdminException$ServerSideErrorException: HTTP 500 Internal Server Error at org.apache.pulsar.client.admin.internal.BaseResource.getApiException(BaseResource.java:219) ~[org.apache.pulsar-pulsar-client-admin-original-2.8.0.jar:2.8.0] at org.apache.pulsar.client.admin.internal.BaseResource$1.failed(BaseResource.java:130) ~[org.apache.pulsar-pulsar-client-admin-original-2.8.0.jar:2.8.0] at org.glassfish.jersey.client.JerseyInvocation$1.failed(JerseyInvocation.java:882) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?] at org.glassfish.jersey.client.JerseyInvocation$1.completed(JerseyInvocation.java:863) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?] at org.glassfish.jersey.client.ClientRuntime.processResponse(ClientRuntime.java:229) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?] at org.glassfish.jersey.client.ClientRuntime.access$200(ClientRuntime.java:62) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?] at org.glassfish.jersey.client.ClientRuntime$2.lambda$response$0(ClientRuntime.java:173) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?] at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?] at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:292) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:274) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:244) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?] at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:288) ~[org.glassfish.jersey.core-jersey-common-2.34.jar:?] at org.glassfish.jersey.client.ClientRuntime$2.response(ClientRuntime.java:173) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?] at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.lambda$apply$1(AsyncHttpConnector.java:212) ~[org.apache.pulsar-pulsar-client-admin-original-2.8.0.jar:2.8.0] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?] at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[?:?] at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.lambda$retryOperation$4(AsyncHttpConnector.java:254) ~[org.apache.pulsar-pulsar-client-admin-original-2.8.0.jar:2.8.0] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?] at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) ~[?:?] at org.asynchttpclient.netty.NettyResponseFuture.loadContent(NettyResponseFuture.java:222) ~[org.asynchttpclient-async-http-client-2.12.1.jar:?] at org.asynchttpclient.netty.NettyResponseFuture.done(NettyResponseFuture.java:257) ~[org.asynchttpclient-async-http-client-2.12.1.jar:?] at org.asynchttpclient.netty.handler.AsyncHttpClientHandler.finishUpdate(AsyncHttpClientHandler.java:241) ~[org.asynchttpclient-async-http-client-2.12.1.jar:?] at org.asynchttpclient.netty.handler.HttpHandler.handleChunk(HttpHandler.java:114) ~[org.asynchttpclient-async-http-client-2.12.1.jar:?] at org.asynchttpclient.netty.handler.HttpHandler.handleRead(HttpHandler.java:143) ~[org.asynchttpclient-async-http-client-2.12.1.jar:?] at org.asynchttpclient.netty.handler.AsyncHttpClientHandler.channelRead(AsyncHttpClientHandler.java:78) ~[org.asynchttpclient-async-http-client-2.12.1.jar:?] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[io.netty-netty-codec-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) ~[io.netty-netty-codec-4.1.63.Final.jar:4.1.63.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) ~[io.netty-netty-codec-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final] at java.lang.Thread.run(Thread.java:829) [?:?] Caused by: javax.ws.rs.InternalServerErrorException: HTTP 500 Internal Server Error at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:960) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?] at org.glassfish.jersey.client.JerseyInvocation.access$700(JerseyInvocation.java:82) ~[org.glassfish.jersey.core-jersey-client-2.34.jar:?] ... 54 more ``` And then, another broker occur error into "error Namespace bundle xxx/public/0xb0000000_0xc0000000 is being unloaded", its would be continuous printing, never stop. ``` 00:12:31.837 [bookkeeper-ml-scheduler-OrderedScheduler-0-0] ERROR org.apache.pulsar.broker.admin.impl.NamespacesBase - [null] Failed to unload namespace bundle xxx/public/0xb0000000_0xc0000000 00:12:31.915 [pulsar-io-4-16] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://xxx/public/xxx-4 with error Namespace bundle xxx/public/0xb0000000_0xc0000000 is being unloaded 00:12:31.916 [pulsar-io-4-16] WARN org.apache.pulsar.broker.service.ServerCnx - [/10.0.39.109:57794] Got exception io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 00:12:31.917 [pulsar-io-4-15] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://xxx/public/xxx-partition-2 with error Namespace bundle xxx/public/0xb0000000_0xc0000000 is being unloaded 00:12:31.917 [pulsar-io-4-16] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://xxx/public/xxx_1001044 with error Namespace bundle xxx/public/0xb0000000_0xc0000000 is being unloaded 00:12:31.917 [pulsar-io-4-15] WARN org.apache.pulsar.broker.service.ServerCnx - [/10.0.32.49:34574] Got exception io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 00:12:31.920 [pulsar-io-4-7] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://xxx/public/xxx-4 with error Namespace bundle xxx/public/0xb0000000_0xc0000000 is being unloaded 00:12:31.920 [pulsar-io-4-4] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://xxx/public/xxx-partition-4 with error Namespace bundle xxx/public/0xb0000000_0xc0000000 is being unloaded 00:12:31.921 [pulsar-io-4-9] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://xxx/public/xxx_message-partition-2 with error Namespace bundle xxx/public/0xb0000000_0xc0000000 is being unloaded 00:12:31.922 [pulsar-io-4-16] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://xxx/public/xxx_message-partition-2 with error Namespace bundle xxx/public/0xb0000000_0xc0000000 is being unloaded ``` And then 3 topics backlog occur massive accumulation! **(3).I tracked** Trigger point that bundle unload failed: <img width="1315" alt="image" src="https://user-images.githubusercontent.com/3917749/176862879-ae200fbe-7dca-4c64-b463-b55816f48942.png"> Result point that namespace bundle xxx is being unloaded: <img width="1169" alt="image" src="https://user-images.githubusercontent.com/3917749/176863064-4ac9572d-f51c-433b-a479-d7bf65e548c1.png"> **(4). I guess** Why occur trigger point, I don't know the exact cause, code is so large, dizzy. But I can guess something according to historical experiences, Topic accumulation has happened occasionally before, I think the broker is seriously short of resources and that's fact, so I double my resources for brokers, and never occur topic accumulation until this morning. I think this phenomenon can be avoided as much as possible by greatly increasing broker resources, because there are enough resources, unload will not occur. But this problem cannot be completely eliminated, because the production environment is complex,for example, the severe network fluctuation of AWS, or the sudden isolation of EC2 where a broker node is located, these emergencies may still trigger this problem. So, this bug shoud be solved still, otherwise stability is a big problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
