dang-stripe opened a new issue, #17465:
URL: https://github.com/apache/pinot/issues/17465

   Related to https://github.com/apache/pinot/issues/15997
   
   ## Problem statement
   
   We noticed query failures while testing the resilience of MSE queries by 
doing a kill -9 on a server process. Once the server comes up, we see queries 
fail for 2-3 minutes. We did not see this behavior on a graceful termination.
   
   We do have the server failure detector enabled on our brokers with:
   
   ```
   pinot.broker.failure.detector.type: CONNECTION
   ```
   
   Timeline:
   
   ```
   2025-12-17 23:43:00 - `kill -9` command issued to a server
   2025-12-17 23:46:37 - server completes startup + query errors start
   2025-12-17 23:48:32 - query errors stop
   ```
   
   Server startup log:
   
   ```
   2025-12-17 23:46:37.367 INFO [ServerInstance] [Start a Pinot [SERVER]:156] 
Starting query server
   2025-12-17 23:46:37.367 INFO [ServerInstance] [Start a Pinot [SERVER]:156] 
Starting query executor
   2025-12-17 23:46:37.367 INFO [ServerQueryExecutorV1Impl] [Start a Pinot 
[SERVER]:156] Query executor started
   2025-12-17 23:46:37.367 INFO [ServerInstance] [Start a Pinot [SERVER]:156] 
Starting query scheduler
   2025-12-17 23:46:37.367 INFO [ServerInstance] [Start a Pinot [SERVER]:156] 
Starting Netty query server
   2025-12-17 23:46:37.408 INFO [ServerInstance] [Start a Pinot [SERVER]:156] 
Starting worker query server
   2025-12-17 23:46:37.408 INFO [QueryServer] [Start a Pinot [SERVER]:156] 
Starting QueryServer
   2025-12-17 23:46:37.425 INFO [QueryServer] [Start a Pinot [SERVER]:156] 
Initialized QueryServer on port: 8421
   2025-12-17 23:46:37.425 INFO [MailboxService] [Start a Pinot [SERVER]:156] 
Starting GrpcMailboxServer
   2025-12-17 23:46:37.443 INFO [ServerQueryExecutorV1Impl] [Start a Pinot 
[SERVER]:156] Query executor started
   2025-12-17 23:46:37.443 INFO [ServerInstance] [Start a Pinot [SERVER]:156] 
Finish starting query server
   2025-12-17 23:46:37.443 INFO [ZKHelixAdmin] [Start a Pinot [SERVER]:156] Set 
configs with keys
   2025-12-17 23:46:37.467 INFO [BaseServerStarter] [Start a Pinot 
[SERVER]:156] Pinot server ready
   2025-12-17 23:46:37.472 INFO [PinotServiceManager] [Start a Pinot 
[SERVER]:156] Pinot Server instance [Server_st6] is Started...
   2025-12-17 23:46:37.472 INFO [StartServiceManagerCommand] [Start a Pinot 
[SERVER]:156] Started Pinot [SERVER] instance [Server_st6] at 185.296s since 
launch
   ```
   
   Broker log - first query error:
   
   ```
   2025-12-17 23:46:37.840
   ERROR [PinotClientRequest] 
[jersey-server-managed-async-executor-107801:137832] Query processing 
exceptions: {200=QueryExecutionError: Error dispatching query: 
1972319055000858327 to server: pinotstreaming--0f98@{8421,8442}
   [2025-12-17 23:46:37.840520] 
org.apache.pinot.query.service.dispatch.QueryDispatcher.processResults(QueryDispatcher.java:379)
   [2025-12-17 23:46:37.840529] 
org.apache.pinot.query.service.dispatch.QueryDispatcher.execute(QueryDispatcher.java:334)
   [2025-12-17 23:46:37.840535] 
org.apache.pinot.query.service.dispatch.QueryDispatcher.submit(QueryDispatcher.java:268)
   [2025-12-17 23:46:37.840558] 
org.apache.pinot.query.service.dispatch.QueryDispatcher.submitAndReduce(QueryDispatcher.java:170)
   [2025-12-17 23:46:37.840567] UNAVAILABLE: io exception
   [2025-12-17 23:46:37.840572] 
io.grpc.Status.asRuntimeException(Status.java:532)
   [2025-12-17 23:46:37.840579] 
io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:564)
   [2025-12-17 23:46:37.840586] 
io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:565)
   [2025-12-17 23:46:37.840592] 
io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:72)
   [2025-12-17 23:46:37.840599] Connection refused: pinotstreaming--0f98@8421
   [2025-12-17 23:46:37.840604] Connection refused
   [2025-12-17 23:46:37.840610] java.base/sun.nio.ch.Net.pollConnect(Native 
Method)
   [2025-12-17 23:46:37.840615] 
java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682)
   [2025-12-17 23:46:37.840633] 
java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:973)
   [2025-12-17 23:46:37.840645] 
io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:336)
   [2025-12-17 23:46:37.840649] Connection refused
   [2025-12-17 23:46:37.840655] java.base/sun.nio.ch.Net.pollConnect(Native 
Method)
   [2025-12-17 23:46:37.840661] 
java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682)
   [2025-12-17 23:46:37.840667] 
java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:973)
   [2025-12-17 23:46:37.840676] 
io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:336)
   [2025-12-17 23:46:37.840679] }
   ```
   
   ## Root cause
   
   By default, the GRPC uses an exponential backoff policy w/ a 2 minute cap to 
retry the connection for ManagedChannels. In 
https://github.com/apache/pinot/blob/master/pinot-broker/src/main/java/org/apache/pinot/broker/routing/BrokerRoutingManager.java#L387-L390,
 we remove the excluded server if it comes up healthy, independently of whether 
the underlying GRPC connection is actually in a READY state. If this happens 
before the the channel has had a chance to establish the connection again, then 
queries will fail.
   
   Here are the failure detector logs to illustrate this:
   
   ```
   # server marked unhealthy after kill -9
   2025-12-17 23:43:02.619 WARN [BaseExponentialBackoffRetryFailureDetector] 
[jersey-server-managed-async-executor-309:981] Mark server: Server_st6 as 
unhealthy
   
   # first grpc connection check (i've skipped the rest to keep it brief)
   2025-12-17 23:43:07.710 INFO [BaseExponentialBackoffRetryFailureDetector] 
[failure-detector-retry:172] Retry unhealthy server: Server_st6
   2025-12-17 23:43:07.710 INFO [SingleConnectionBrokerRequestHandler] 
[failure-detector-retry:172] Retrying unhealthy server: Server_st6
   2025-12-17 23:43:07.710 INFO [MultiStageBrokerRequestHandler] 
[failure-detector-retry:172] Checking gRPC connection to unhealthy server: 
Server_st6
   2025-12-17 23:43:07.710 INFO [QueryDispatcher] [failure-detector-retry:172] 
Still can't connect to server: Server_st6, current state: TRANSIENT_FAILURE
   
   # server included in routing after it comes up healthy
   2025-12-17 23:46:37.483 INFO [BrokerRoutingManager] 
[ClusterChangeHandlingThread:225] Got excluded server: Server_st6 re-enabled, 
including it into the routing
   
   # failure detector confirms connection was successful minutes later
   2025-12-17 23:48:17.711 INFO [MultiStageBrokerRequestHandler] 
[failure-detector-retry:172] Checking gRPC connection to unhealthy server: 
Server_st6
   2025-12-17 23:48:17.711 INFO [QueryDispatcher] [failure-detector-retry:172] 
Successfully connected to server: Server_st6
   2025-12-17 23:48:17.712 INFO [BaseExponentialBackoffRetryFailureDetector] 
[failure-detector-retry:172] Mark server: Server_st6 as healthy
   2025-12-17 23:48:17.712 INFO [BrokerRoutingManager] 
[failure-detector-retry:172] Including server: Server_st6 to routing
   2025-12-17 23:48:17.712 INFO [BrokerRoutingManager] 
[failure-detector-retry:172] Server: Server_st6 is not previously excluded, 
skipping updating the routing
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to