dang-stripe opened a new issue, #17870: URL: https://github.com/apache/pinot/issues/17870
## Problem This is related to https://github.com/apache/pinot/issues/17465, but for server <> server connections. We recently hit this when a instance became unresponsive for ~20 minutes due to hardware issues. When it came back up healthy after a reboot, we saw a 1-2 minute duration where queries started failing. ## Timeline 9:45 - Server1 goes unresponsive 10:06 - Server1 gets rebooted 10:08 - Server1 finishes start up and broker starts routing queries to it. Query failures start as other servers get UNAVAILABLE exceptions for messages from Server1. 10:11 - Queries stop failing which roughly lines up with 2 minute cap for GRPC exponential backoff ## Example logs Impacted server starting up cleanly again ``` 2026-03-12 10:07:28.494 INFO [HelixInstanceDataManager] [Start a Pinot [SERVER]:119] Helix instance data manager started 2026-03-12 10:08:47.483 INFO [ServerQueryExecutorV1Impl] [Start a Pinot [SERVER]:119] Query executor started 2026-03-12 10:08:47.669 INFO [ServerQueryExecutorV1Impl] [Start a Pinot [SERVER]:119] Query executor started ``` Logs from other servers: ``` [2026-03-12 10:08:47.834] WARN [MailboxStatusObserver] [grpc-default-executor-24673:45621] Sending mailbox received an error from receiving side [2026-03-12 10:08:47.834703] io.grpc.StatusRuntimeException: UNAVAILABLE: io exception [2026-03-12 10:08:47.834708] at io.grpc.Status.asRuntimeException(Status.java:532) [2026-03-12 10:08:47.834716] at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:581) [2026-03-12 10:08:47.834723] at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:566) [2026-03-12 10:08:47.834730] at io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:72) [2026-03-12 10:08:47.834738] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:734) [2026-03-12 10:08:47.834746] at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:715) [2026-03-12 10:08:47.834752] at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [2026-03-12 10:08:47.834762] at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) [2026-03-12 10:08:47.834769] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [2026-03-12 10:08:47.834777] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [2026-03-12 10:08:47.834783] at java.base/java.lang.Thread.run(Thread.java:1583) [2026-03-12 10:08:47.834795] Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: server1/10.20.30.40 [2026-03-12 10:08:47.834801] Caused by: java.net.ConnectException: Connection refused [2026-03-12 10:08:47.834812] at java.base/sun.nio.ch.Net.pollConnect(Native Method) [2026-03-12 10:08:47.834818] at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:694) [2026-03-12 10:08:47.834826] at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:973) [2026-03-12 10:08:47.834834] at io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:336) [2026-03-12 10:08:47.834843] at io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339) [2026-03-12 10:08:47.834872] at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:784) [2026-03-12 10:08:47.834881] at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:732) [2026-03-12 10:08:47.834888] at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:658) [2026-03-12 10:08:47.834896] at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [2026-03-12 10:08:47.834904] at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [2026-03-12 10:08:47.834911] at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [2026-03-12 10:08:47.834919] at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [2026-03-12 10:08:47.834927] at java.base/java.lang.Thread.run(Thread.java:1583) ``` ## Potential fix https://github.com/apache/pinot/pull/17466 reset the connection backoff for broker <> server GRPC connections when a server is killed uncleanly and comes back up, but it did not handle server <> server GRPC connections. Current thinking is to implement a ClusterChangeHandler for servers to listen to when other servers come online and then reset the connection backoff for their respective GRPC channel. cc @suvodeep-pyne @Jackie-Jiang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
