> Just curious, how has this been determined that the GOAWAY frame wasn't received? Also what are your values of MAX_CONNECTION_AGE and MAX_CONNECTION_AGE_GRACE ?
MAX_CONNECTION_AGE and MAX_CONNECTION_AGE_GRACE was infinite but I changed this week MAX_CONNECTION_AGE to 5 minutes. I followed this documentation to display gRPC logs and the see GOAWAY signal. https://github.com/grpc/grpc/blob/v1.25.x/TROUBLESHOOTING.md https://github.com/grpc/grpc/blob/master/doc/environment_variables.md To reproduce the error, I setup a channel without round robin load balancing (only one subchannel). ExampleService::Stub.new("headless-test-grpc-master.test-grpc.svc.cluster.local:50051", :this_channel_is_insecure, timeout: 5) Then I recursively kill the server pod connected to my client. When I see in the logs that GOAWAY signal is received, a reconnection occurs without any error in my requests. But when the reception of the GOAWAY signal is not logged, no reconnection occurs and I receive a bunch of DeadlineExceeded errors during several minutes. The error still occur even if I create a new channel. However, if a recreate the channel adding "dns:" at the beginning of the host, it works. ExampleService::Stub.new("dns:headless-test-grpc-master.test-grpc.svc.cluster.local:50051", :this_channel_is_insecure, timeout: 5) The opposite if true. If I create the channel with "dns:" at the beginning of the host, it can lead to the same failure and I will be able to create a working channel removing the "dns:" at the beginning of the host. *Did you already heard this kind of issue? Is there some cache in the dns resolver?* > A guess: one possible thing to look for is if IP packets to/from the pod's address stopped forwarding, rendering the TCP connection to it a "black hole". In that case, a grpc client will, by default, realize that a connection is bad only after the TCP connection times out (typically ~15 minutes). You may set keepalive parameters to notice the brokenness of such connections faster -- see references to keepalive in https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md for more details. Yes. It is like requests go to a black hole. And has you said, it is naturally fixed by itself after around 15 minutes. I will add a client side keep alive to make it shorter. But even with 1 minute instead of 15, I need to find another workaround in order to avoid degraded services for my customer. Thank you. Le mardi 22 décembre 2020 à 21:34:32 UTC+1, [email protected] a écrit : > > It happens that sometimes, the GOAWAY signal isn't received by the > client. > > Just curious, how has this been determined that the GOAWAY frame wasn't > received? Also what are your values of MAX_CONNECTION_AGE and > MAX_CONNECTION_AGE_GRACE ? > > A guess: one possible thing to look for is if IP packets to/from the pod's > address stopped forwarding, rendering the TCP connection to it a "black > hole". In that case, a grpc client will, by default, realize that a > connection is bad only after the TCP connection times out (typically ~15 > minutes). You may set keepalive parameters to notice the brokenness of such > connections faster -- see references to keepalive in > https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md > for more details. > > > > On Tuesday, December 22, 2020 at 11:30:44 AM UTC-8 Emmanuel Delmas wrote: > >> Thank you. I've setup MAX_CONNECTION_AGE and it seems to work well. >> >> I was looking for a way to refresh the name resolution because I'm facing >> another issue. >> It happens that sometimes, the GOAWAY signal isn't received by the client. >> In this case, I receive a bunch of DeadlineExceeded errors, the client >> still sending message to a deleted Kubernetes pod. >> I wanted to trigger a refresh at this time but I understand it is not >> possible. >> >> Do you already get this kind of issue? >> Do you have any advice to handle a not received GOAWAY signal? >> >> Le lundi 21 décembre 2020 à 19:42:17 UTC+1, [email protected] a écrit : >> >>> > "But when I create new pods after the connection or a reconnection, >>> calls are not load balanced on these new servers." >>> >>> Can you elaborate a bit on what exactly is done here and the expected >>> behavior? >>> >>> In general, one thing to note about gRPC's client channel/stub is that >>> in general a client will not refresh the name resolution process unless it >>> encounters a problem with the current connection(s) that it has. So for >>> example if the following events happen: >>> 1) client stub resolves >>> headless-test-grpc-master.test-grpc.svc.cluster.local in DNS, to addresses >>> 1.1.1.1, 2.2.2.2, and 3.3.3.3 >>> 2) client stub establishes connections to 1.1.1.1, 2.2.2.2, and 3.3.3.3, >>> and begins round robining RPCs across them >>> 3) a new host, 4.4.4.4, starts up, and is added behind the >>> headless-test-grpc-master.test-grpc.svc.cluster.local DNS name >>> >>> Then the client will continue to just round robin its RPCs across >>> 1.1.1.1, 2.2.2.2, and 3.3.3.3 indefinitely -- so long as it doesn't >>> encounter a problem with those connections. It will only re-query the DNS, >>> and so learn about 4.4.4.4, if it encounters a problem. >>> >>> There's some possibly interesting discussion about this behavior in >>> https://github.com/grpc/grpc/issues/12295 and in >>> https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md. >>> >>> On Thursday, December 3, 2020 at 8:57:03 AM UTC-8 Emmanuel Delmas wrote: >>> >>>> Hi >>>> >>>> *Question* >>>> I'm wondering how to refresh the IP list in order to update subchannel >>>> list, after creating gRPC channel in Ruby using DNS resolution (which >>>> created several subchannels). >>>> >>>> *Context* >>>> I've setup gRPC communication between our services in a Kubernetes >>>> environnement two years ago but we are facing issues after pods restart. >>>> >>>> I've setup a Kubernetes headless service (in order to get all pod IPs >>>> from the DNS). >>>> I've managed to use load balancing with the following piece of code. >>>> stub = >>>> ExampleService::Stub.new("headless-test-grpc-master.test-grpc.svc.cluster.local:50051", >>>> >>>> :this_channel_is_insecure, timeout: 5, channel_args: >>>> {'grpc.lb_policy_name' >>>> => 'round_robin'}) >>>> >>>> But when I create new pods after the connection or a reconnection, >>>> calls are not load balanced on these new servers. >>>> That why I'm wondering what should I do to make the gRPC resolver >>>> refresh the list of IP and create expected new subchannels. >>>> >>>> Is it something achievable? Which configuration should I use? >>>> >>>> Thanks for your help >>>> >>>> *Emmanuel Delmas* >>>> Backend Developer >>>> CSE Member >>>> https://github.com/papa-cool >>>> >>>> -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/2aa232ce-e698-45db-b53c-5b38c2e82ef1n%40googlegroups.com.
