let me try to check if backoff will help, but the client never recovers, we noticed that one of the servers didn't recover in 36 hours, until we restarted the service.
On Friday, August 22, 2025 at 1:42:33 AM UTC-7 Kannan Jayaprakasam wrote: > The gRPC Java client would have tried refreshing the IP addresses but what > must have happened is timing issues in the headless service's scaling up. > When the old set of pods go down they would have sent a GOAWAY on the > established connections by the client (since you use *keepalive* on the > client, it is all the more likely the GOAWAY was not lost to the client). > This would have immediately caused a re-resolution and still have received > the old addresses as the new pods may not have come up yet, and fail to > establish connection and cause rpcs to fail. After all the addresses fail > to connect, name re-resolution will be triggered and a re-connection > scheduled after a backoff time dictated by the connection backoff policy > <https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md> in > your service config. You can try one of the following options. > 1. Using the connection backoff policy to wait for lesser time or > configuring retry policies for rpcs to wait for longer. > 2. Forcing channel reconnect with ManagedChannel.resetConnectBackoff to > reset the back off timer and cause a re-resolution and reconnect. > 2. Using waitForReady in CallOptions so the RPC waits for the channel to > become ready. > 3. Active polling of the channel state with ManagedChannel.getState or > ManagedChannel.notifyWhenStateChanged to know when the channel becomes > READY. > > On Thursday, August 21, 2025 at 1:06:14 AM UTC+5:30 Maksim Likharev wrote: > >> I’m observing the following behavior: Service S1 (java microservice) >> communicates with Service S2 (java microservice) using gRPC unary calls, >> and both services run in k8s. The gRPC client in S1 uses keepalive and >> resolves a headless Service (which returns multiple IP addresses). After >> scaling S2 down and then back up, the gRPC client in S1 stops >> communicating, UNAVAILABLE error, logs indicate it continues using stale >> IP addresses. >> Problem does not resolve until restart of the S1. k8s headless service >> has correct IP addresses, and name resolution from the pod ( nslookup/dig) >> shows correct IPs as well, so this is not an infrastructure problem. >> >> What could be causing this, and how can I force the gRPC client to >> refresh its DNS cache? >> > -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/grpc-io/d7153f3e-5b12-422e-afaf-45eafcf90f79n%40googlegroups.com.