dang-stripe opened a new issue, #17870:
URL: https://github.com/apache/pinot/issues/17870

   ## Problem
   
   This is related to https://github.com/apache/pinot/issues/17465, but for 
server <> server connections. We recently hit this when a instance became 
unresponsive for ~20 minutes due to hardware issues. When it came back up 
healthy after a reboot, we saw a 1-2 minute duration where queries started 
failing.
   
   ## Timeline
   
   9:45 - Server1 goes unresponsive
   10:06 - Server1 gets rebooted
   10:08 - Server1 finishes start up and broker starts routing queries to it. 
Query failures start as other servers get UNAVAILABLE exceptions for messages 
from Server1.
   10:11 - Queries stop failing which roughly lines up with 2 minute cap for 
GRPC exponential backoff
   
   ## Example logs
   Impacted server starting up cleanly again
   ```
   2026-03-12 10:07:28.494 INFO [HelixInstanceDataManager] [Start a Pinot 
[SERVER]:119] Helix instance data manager started
   2026-03-12 10:08:47.483 INFO [ServerQueryExecutorV1Impl] [Start a Pinot 
[SERVER]:119] Query executor started
   2026-03-12 10:08:47.669 INFO [ServerQueryExecutorV1Impl] [Start a Pinot 
[SERVER]:119] Query executor started
   ```
   
   Logs from other servers:
   ```
   [2026-03-12 10:08:47.834] WARN [MailboxStatusObserver] 
[grpc-default-executor-24673:45621] Sending mailbox received an error from 
receiving side
   [2026-03-12 10:08:47.834703] io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
   [2026-03-12 10:08:47.834708]         at 
io.grpc.Status.asRuntimeException(Status.java:532)
   [2026-03-12 10:08:47.834716]         at 
io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:581)
   [2026-03-12 10:08:47.834723]         at 
io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:566)
   [2026-03-12 10:08:47.834730]         at 
io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:72)
   [2026-03-12 10:08:47.834738]         at 
io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:734)
   [2026-03-12 10:08:47.834746]         at 
io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:715)
   [2026-03-12 10:08:47.834752]         at 
io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
   [2026-03-12 10:08:47.834762]         at 
io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
   [2026-03-12 10:08:47.834769]         at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
   [2026-03-12 10:08:47.834777]         at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
   [2026-03-12 10:08:47.834783]         at 
java.base/java.lang.Thread.run(Thread.java:1583)
   [2026-03-12 10:08:47.834795] Caused by: 
io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: server1/10.20.30.40
   [2026-03-12 10:08:47.834801] Caused by: java.net.ConnectException: 
Connection refused
   [2026-03-12 10:08:47.834812]         at 
java.base/sun.nio.ch.Net.pollConnect(Native Method)
   [2026-03-12 10:08:47.834818]         at 
java.base/sun.nio.ch.Net.pollConnectNow(Net.java:694)
   [2026-03-12 10:08:47.834826]         at 
java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:973)
   [2026-03-12 10:08:47.834834]         at 
io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:336)
   [2026-03-12 10:08:47.834843]         at 
io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339)
   [2026-03-12 10:08:47.834872]         at 
io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:784)
   [2026-03-12 10:08:47.834881]         at 
io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:732)
   [2026-03-12 10:08:47.834888]         at 
io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:658)
   [2026-03-12 10:08:47.834896]         at 
io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
   [2026-03-12 10:08:47.834904]         at 
io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998)
   [2026-03-12 10:08:47.834911]         at 
io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
   [2026-03-12 10:08:47.834919]         at 
io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
   [2026-03-12 10:08:47.834927]         at 
java.base/java.lang.Thread.run(Thread.java:1583)
   ```
   
   ## Potential fix
   
   https://github.com/apache/pinot/pull/17466 reset the connection backoff for 
broker <> server GRPC connections when a server is killed uncleanly and comes 
back up, but it did not handle server <> server GRPC connections. Current 
thinking is to implement a ClusterChangeHandler for servers to listen to when 
other servers come online and then reset the connection backoff for their 
respective GRPC channel.
   
   cc @suvodeep-pyne @Jackie-Jiang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to