Re: [grpc-io] GRPC C++ Question on best practices for Client handling of servers going up and down

Robert Engels Wed, 21 Nov 2018 09:19:57 -0800

I’m glad you got it working. Something doesn’t seem right though that going 
from 1 sec to 2 sec causes things to work... I would think that production 
anomalies could easily cause similar degradation, so I think this solution may 
be fragile... still, I’m still not sure I 100% get the problem you’re having 
but if you understand exactly why it works that’s good enough for me :)


> On Nov 21, 2018, at 11:09 AM, justin.cheuvr...@ansys.com wrote:
> 
> I found a fix for my problem.
> 
> Looked at the latest source and there was a test argument being used 
> "grpc.testing.fixed_reconnect_backoff_ms" that sets both the min and max 
> backoff times to that value. Found the source for 1.3.9 on a machine and it 
> was being used then as well. Setting that argument to 2000ms does what I 
> wanted. 1000ms seemed to be to low of a value, the rpc calls continued to 
> fail. 2000 seems to reconnect just fine. Once we update to a newer version of 
> grpc I can change that to set the min and max backoff times.
> 
>> On Wednesday, November 21, 2018 at 11:24:58 AM UTC-5, Robert Engels wrote:
>> Sorry. Still if you forcibly remove the cable or hard shut the server, the 
>> client can’t tell if the server is down with out some sort of ping pong 
>> protocol with timeout. The tcp ip timeout is on the order of hours, and 
>> minutes if keep alive is set. 
>> 
>>> On Nov 21, 2018, at 10:20 AM, justin.c...@ansys.com wrote:
>>> 
>>> That must have been a different person :). I'm actually taking down the 
>>> server and restarting it, no simulation of it.
>>> 
>>>> On Wednesday, November 21, 2018 at 11:17:24 AM UTC-5, Robert Engels wrote:
>>>> I thought your original message said you were simulating the server going 
>>>> down using iptables and causing packet loss?
>>>> 
>>>>> On Nov 21, 2018, at 9:52 AM, justin.c...@ansys.com wrote:
>>>>> 
>>>>> I'm not sure I follow you on that one. I am taking the server up and down 
>>>>> myself. Everything works fine if I just make rpc calls on the client and 
>>>>> check the error codes. The problem was the 20 seconds blocking on 
>>>>> secondary rpc calls for the reconnect, which seems to be due to the 
>>>>> backoff algorithm. I was hoping to shrink that wait if possible to 
>>>>> something smaller. Setting the GRPC_ARG_MAX_RECONNECT_BACKOFF_MS to 5000 
>>>>> seemed to still take the full 20 seconds when making an RPC call.
>>>>> 
>>>>> Using GetState on the channel looked like it was going to get rid of the 
>>>>> blocking nature on a broken connection but the state of the channel 
>>>>> doesn't seem to change from transient failure once the server comes back 
>>>>> up. Tried using KEEPALIVE_TIME, KEEPALIVE_TIMEOUT and 
>>>>> KEEPALIVE_PERMIT_WITHOUT_CALLS but those didn't seem to trigger a state 
>>>>> change on the channel.
>>>>> 
>>>>> Seems like the only way to trigger a state change on the channel is to 
>>>>> make an actual rpc call.
>>>>> 
>>>>> I think the answer might just be update to a newer version of rpc and 
>>>>> look at using the MIN_RECONNECT_BACKOFF channel arg setting and probably 
>>>>> downloading the source and looking at how those variables are used :). 
>>>>> 
>>>>> 
>>>>>> On Wednesday, November 21, 2018 at 10:16:45 AM UTC-5, Robert Engels 
>>>>>> wrote:
>>>>>> The other thing to keep in mind is that the way you are “forcing 
>>>>>> failure” is error prone - the connection is valid as packets are making 
>>>>>> it through. It is just that is will be very slow due to extreme packet 
>>>>>> loss. I am not sure this is considered a failure by gRPC. I think you 
>>>>>> would need to detect slow network connections and abort that server 
>>>>>> yourself. 
>>>>>> 
>>>>>>> On Nov 21, 2018, at 9:12 AM, justin.c...@ansys.com wrote:
>>>>>>> 
>>>>>>> I do check the error code after each update and skip the rest of the 
>>>>>>> current iterations updates if a failure occurred.
>>>>>>> 
>>>>>>> I could skip all updates for 20 seconds after an update but that seems 
>>>>>>> less than ideal.
>>>>>>> 
>>>>>>> By server available I was using the GetState on the channel. The 
>>>>>>> problem I was running into was that if I only call GetState on the 
>>>>>>> channel to see if the server is around it "forever" stays in the state 
>>>>>>> of transient failure (at least for 60 seconds). I was expecting to see 
>>>>>>> a state change back to idle/ready after a bit.
>>>>>>> 
>>>>>>>> On Tuesday, November 20, 2018 at 11:19:09 PM UTC-5, robert engels 
>>>>>>>> wrote:
>>>>>>>> You should track the err after each update, and if non-nil, just 
>>>>>>>> return… why keep trying the further updates in that loop.
>>>>>>>> 
>>>>>>>> It is also trivial too - to not even attempt the next loop if it has 
>>>>>>>> been less than N ms since the last error.
>>>>>>>> 
>>>>>>>> According to your pseudo code, you already have the ‘server available’ 
>>>>>>>> status.
>>>>>>>> 
>>>>>>>>> On Nov 20, 2018, at 9:22 PM, justin.c...@ansys.com wrote:
>>>>>>>>> 
>>>>>>>>> GRPC Version: 1.3.9
>>>>>>>>> Platform: Windows
>>>>>>>>> 
>>>>>>>>> I'm working on a prototype application that periodically calculates 
>>>>>>>>> data and then in a multi-step process pushes the data to a server. 
>>>>>>>>> The design is that the server doesn't need to be up or can go down 
>>>>>>>>> mid process. The client will not block (or block as little as 
>>>>>>>>> possible) between updates if there is problem pushing data.
>>>>>>>>> 
>>>>>>>>> A simple model for the client would be:
>>>>>>>>> Loop Until Done
>>>>>>>>> {
>>>>>>>>>  Calculate Data
>>>>>>>>>  If Server Available and No Error Begin Update
>>>>>>>>>  If Server Available and No Error UpdateX (Optional)
>>>>>>>>>  If Server Available and No Error UpdateY (Optional)
>>>>>>>>>  If Server Available and No Error UpdateZ (Optional)
>>>>>>>>>  If Server Available and No Error End Update
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> The client doesn't care if the server is available but if it is 
>>>>>>>>> should push data, if any errors skip everything else until next 
>>>>>>>>> update.
>>>>>>>>> 
>>>>>>>>> The problem is that if I make an call on the client (and the server 
>>>>>>>>> isn't available) the first fails very quickly (~1sec) and the rest 
>>>>>>>>> take a "long" time, ~20sec. It looks like this is due to the 
>>>>>>>>> reconnect backoff time. I tried setting the 
>>>>>>>>> GRPC_ARG_MAX_RECONNECT_BACKOFF_MS on the channel args to a lower 
>>>>>>>>> value (2000) but that didn't have any positive affect.
>>>>>>>>> 
>>>>>>>>> I tried using GetState(true) on the channel to determine if we need 
>>>>>>>>> to skip an update. This call fails very quickly but never seems to 
>>>>>>>>> get out of the transient failure state after the server was started 
>>>>>>>>> (waited for over 60 seconds). On the documentation it looked like the 
>>>>>>>>> param for GetState only affects if the channel was in the idle state 
>>>>>>>>> to attempt a reconnect.
>>>>>>>>> 
>>>>>>>>> What is the best way to achieve the functionality we'd like?
>>>>>>>>> 
>>>>>>>>> I noticed there was a new GRPC_ARG_MIN_RECONNECT_BACKOFF_MS option 
>>>>>>>>> added in a later version of grpc, would that cause the grpc call to 
>>>>>>>>> "fail fast" if I upgraded and set that to a low value ~1sec?
>>>>>>>>> 
>>>>>>>>> Is there a better way to handle this situation in general?
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "grpc.io" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to grpc-io+u...@googlegroups.com.
>>>>>>>>> To post to this group, send email to grp...@googlegroups.com.
>>>>>>>>> Visit this group at https://groups.google.com/group/grpc-io.
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/grpc-io/9fb7bf54-88fa-4781-8864-c9b2b06d5f0e%40googlegroups.com.
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "grpc.io" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>>>> an email to grpc-io+u...@googlegroups.com.
>>>>>>> To post to this group, send email to grp...@googlegroups.com.
>>>>>>> Visit this group at https://groups.google.com/group/grpc-io.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/grpc-io/c8c655a5-75d0-44f0-8103-d47217adf251%40googlegroups.com.
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>> 
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google Groups 
>>>>> "grpc.io" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>>> email to grpc-io+u...@googlegroups.com.
>>>>> To post to this group, send email to grp...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/grpc-io.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/grpc-io/0cd559d0-ea9b-45f5-8a02-b1b5972942ba%40googlegroups.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "grpc.io" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to grpc-io+u...@googlegroups.com.
>>> To post to this group, send email to grp...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/grpc-io.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/grpc-io/a5359020-820e-460f-8ee6-ccbb1771df06%40googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "grpc.io" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to grpc-io+unsubscr...@googlegroups.com.
> To post to this group, send email to grpc-io@googlegroups.com.
> Visit this group at https://groups.google.com/group/grpc-io.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/grpc-io/7e2ce2c0-778c-4022-b7a8-ddf5c1235022%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To post to this group, send email to grpc-io@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/54DC4E9F-0EDA-44EA-BDCF-E46707A57F73%40earthlink.net.
For more options, visit https://groups.google.com/d/optout.

Re: [grpc-io] GRPC C++ Question on best practices for Client handling of servers going up and down

Reply via email to