Hi Chen > Given that client is doing client-side LB with round_robin, is setting max_connection_age on the server-side the right way to solve this problem? Will clients be able to refresh and reconnect automatically, or do we need to recreate the client (the underlying channel) periodically? I set max_connection_age on the server-side and it works well. Nothing else to do on the client side. When max_connection_age is reached, a GOAWAY signal is sent to the client. Each time a client receives a GOAWAY signal, it automatically refreshes its DNS and creates connections for new services and the one that has been closed.
> Also, the GOAWAY signal is random. Do client implementation need to handle this in particular? What do you mean exactly? I'm not sure to be able to answer this point. Regards *Emmanuel Delmas* Backend Developer CSE Member *LinkedIn <https://www.linkedin.com/in/emmanueldelmasisep/>.____* *19 rue Blanche, 75009 Paris, France* Le mer. 1 sept. 2021 à 01:43, Chen Song <[email protected]> a écrit : > I want to follow up on this thread, as we have similar requirements (force > clients to refresh server addresses from dns resolver as new pods will be > launched on K8s) but client is in Python. > > Given that client is doing client-side LB with round_robin, is setting > max_connection_age on the server-side the right way to solve this problem? > Will clients be able to refresh and reconnect automatically, or do we need > to recreate the client (the underlying channel) periodically? > Also, the GOAWAY signal is random. Do client implementation need to handle > this in particular? > > Chen > On Wednesday, December 23, 2020 at 4:50:31 AM UTC-5 Emmanuel Delmas wrote: > >> > Just curious, how has this been determined that the GOAWAY frame wasn't >> received? Also what are your values of MAX_CONNECTION_AGE and >> MAX_CONNECTION_AGE_GRACE ? >> >> MAX_CONNECTION_AGE and MAX_CONNECTION_AGE_GRACE was infinite but I >> changed this week MAX_CONNECTION_AGE to 5 minutes. >> >> I followed this documentation to display gRPC logs and the see GOAWAY >> signal. >> https://github.com/grpc/grpc/blob/v1.25.x/TROUBLESHOOTING.md >> https://github.com/grpc/grpc/blob/master/doc/environment_variables.md >> To reproduce the error, I setup a channel without round robin load >> balancing (only one subchannel). >> ExampleService::Stub.new("headless-test-grpc-master.test-grpc.svc.cluster.local:50051", >> :this_channel_is_insecure, timeout: 5) >> Then I recursively kill the server pod connected to my client. When I see >> in the logs that GOAWAY signal is received, a reconnection occurs without >> any error in my requests. But when the reception of the GOAWAY signal is >> not logged, no reconnection occurs and I receive a bunch of >> DeadlineExceeded errors during several minutes. >> The error still occur even if I create a new channel. However, if a >> recreate the channel adding "dns:" at the beginning of the host, it works. >> ExampleService::Stub.new("dns:headless-test-grpc-master.test-grpc.svc.cluster.local:50051", >> :this_channel_is_insecure, timeout: 5) >> The opposite if true. If I create the channel with "dns:" at the >> beginning of the host, it can lead to the same failure and I will be able >> to create a working channel removing the "dns:" at the beginning of the >> host. >> >> >> *Did you already heard this kind of issue? Is there some cache in the dns >> resolver?* >> >> > A guess: one possible thing to look for is if IP packets to/from the >> pod's address stopped forwarding, rendering the TCP connection to it a >> "black hole". In that case, a grpc client will, by default, realize that a >> connection is bad only after the TCP connection times out (typically ~15 >> minutes). You may set keepalive parameters to notice the brokenness of such >> connections faster -- see references to keepalive in >> https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md >> for more details. >> >> Yes. It is like requests go to a black hole. And has you said, it is >> naturally fixed by itself after around 15 minutes. I will add a client side >> keep alive to make it shorter. But even with 1 minute instead of 15, I need >> to find another workaround in order to avoid degraded services for my >> customer. >> >> Thank you. >> >> Le mardi 22 décembre 2020 à 21:34:32 UTC+1, [email protected] a écrit : >> >>> > It happens that sometimes, the GOAWAY signal isn't received by the >>> client. >>> >>> Just curious, how has this been determined that the GOAWAY frame wasn't >>> received? Also what are your values of MAX_CONNECTION_AGE and >>> MAX_CONNECTION_AGE_GRACE ? >>> >>> A guess: one possible thing to look for is if IP packets to/from the >>> pod's address stopped forwarding, rendering the TCP connection to it a >>> "black hole". In that case, a grpc client will, by default, realize that a >>> connection is bad only after the TCP connection times out (typically ~15 >>> minutes). You may set keepalive parameters to notice the brokenness of such >>> connections faster -- see references to keepalive in >>> https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md >>> for more details. >>> >>> >>> >>> On Tuesday, December 22, 2020 at 11:30:44 AM UTC-8 Emmanuel Delmas wrote: >>> >>>> Thank you. I've setup MAX_CONNECTION_AGE and it seems to work well. >>>> >>>> I was looking for a way to refresh the name resolution because I'm >>>> facing another issue. >>>> It happens that sometimes, the GOAWAY signal isn't received by the >>>> client. >>>> In this case, I receive a bunch of DeadlineExceeded errors, the client >>>> still sending message to a deleted Kubernetes pod. >>>> I wanted to trigger a refresh at this time but I understand it is not >>>> possible. >>>> >>>> Do you already get this kind of issue? >>>> Do you have any advice to handle a not received GOAWAY signal? >>>> >>>> Le lundi 21 décembre 2020 à 19:42:17 UTC+1, [email protected] a écrit : >>>> >>>>> > "But when I create new pods after the connection or a reconnection, >>>>> calls are not load balanced on these new servers." >>>>> >>>>> Can you elaborate a bit on what exactly is done here and the expected >>>>> behavior? >>>>> >>>>> In general, one thing to note about gRPC's client channel/stub is that >>>>> in general a client will not refresh the name resolution process unless it >>>>> encounters a problem with the current connection(s) that it has. So for >>>>> example if the following events happen: >>>>> 1) client stub resolves >>>>> headless-test-grpc-master.test-grpc.svc.cluster.local in DNS, to addresses >>>>> 1.1.1.1, 2.2.2.2, and 3.3.3.3 >>>>> 2) client stub establishes connections to 1.1.1.1, 2.2.2.2, and >>>>> 3.3.3.3, and begins round robining RPCs across them >>>>> 3) a new host, 4.4.4.4, starts up, and is added behind the >>>>> headless-test-grpc-master.test-grpc.svc.cluster.local DNS name >>>>> >>>>> Then the client will continue to just round robin its RPCs across >>>>> 1.1.1.1, 2.2.2.2, and 3.3.3.3 indefinitely -- so long as it doesn't >>>>> encounter a problem with those connections. It will only re-query the DNS, >>>>> and so learn about 4.4.4.4, if it encounters a problem. >>>>> >>>>> There's some possibly interesting discussion about this behavior in >>>>> https://github.com/grpc/grpc/issues/12295 and in >>>>> https://github.com/grpc/proposal/blob/master/A9-server-side-conn-mgt.md >>>>> . >>>>> >>>>> On Thursday, December 3, 2020 at 8:57:03 AM UTC-8 Emmanuel Delmas >>>>> wrote: >>>>> >>>>>> Hi >>>>>> >>>>>> *Question* >>>>>> I'm wondering how to refresh the IP list in order to update >>>>>> subchannel list, after creating gRPC channel in Ruby using DNS resolution >>>>>> (which created several subchannels). >>>>>> >>>>>> *Context* >>>>>> I've setup gRPC communication between our services in a Kubernetes >>>>>> environnement two years ago but we are facing issues after pods restart. >>>>>> >>>>>> I've setup a Kubernetes headless service (in order to get all pod IPs >>>>>> from the DNS). >>>>>> I've managed to use load balancing with the following piece of code. >>>>>> stub = >>>>>> ExampleService::Stub.new("headless-test-grpc-master.test-grpc.svc.cluster.local:50051", >>>>>> :this_channel_is_insecure, timeout: 5, channel_args: >>>>>> {'grpc.lb_policy_name' >>>>>> => 'round_robin'}) >>>>>> >>>>>> But when I create new pods after the connection or a reconnection, >>>>>> calls are not load balanced on these new servers. >>>>>> That why I'm wondering what should I do to make the gRPC resolver >>>>>> refresh the list of IP and create expected new subchannels. >>>>>> >>>>>> Is it something achievable? Which configuration should I use? >>>>>> >>>>>> Thanks for your help >>>>>> >>>>>> *Emmanuel Delmas* >>>>>> Backend Developer >>>>>> CSE Member >>>>>> https://github.com/papa-cool >>>>>> >>>>>> >> > -- > You received this message because you are subscribed to a topic in the > Google Groups "grpc.io" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/grpc-io/j18OMinOAxo/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/grpc-io/4e4a7c0e-9cb7-47a8-8392-c0b80a06dba7n%40googlegroups.com > <https://groups.google.com/d/msgid/grpc-io/4e4a7c0e-9cb7-47a8-8392-c0b80a06dba7n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/CAJPrZsSs%3DnVV6vPo6tYX6GssE7BENe2%2ByfCYcpuXDAyCLXz-fw%40mail.gmail.com.
