Hello!

We've been able to locate where the client channel stops attempting to 
reconnect, but haven't found how/why the c-ares resolver successfully 
passes a 0 address list to the pick_first load balancer. What appears to be 
happening is it hits this 0 addresses check 
<https://github.com/grpc/grpc/blob/v1.36.4/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc#L192>
 and 
causes a TRANSIENT_FAILURE, but then the client channel never responds 
beyond that. We've seen this same freeze happen in v1.46.4 at the same 
subchannel 
list check 
<https://github.com/grpc/grpc/blob/v1.46.4/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc#L191>,
 
but I don't necessarily know if it's possible to hit that if statement in a 
practice as the reproduction method I've used is to simply provide an empty 
address list to that method. Trying to actually get an empty c-ares to 
reproduce the behavior we're seeing has proven to be difficult as most of 
the time the resolver behaves as you mentioned.

What normally listens for the UpdateState from the channel_control_helper? 
That might give us a good hint for why the client channel stops after that 
point.

Thanks!
Chi
Cisco Meraki
On Wednesday, August 17, 2022 at 11:35:47 AM UTC-6 Mark D. Roth wrote:

> Can you try running with the following environment variables set, and 
> share the log?  That might help us figure out what's going on here.
>
> GRPC_VERBOSITY=DEBUG
> GRPC_TRACE=client_channel_routing,pick_first,cares_resolver
>
> In general, the c-ares resolver should return an error when there's an 
> empty address list, so it should automatically retry the resolution 
> periodically until it succeeds.  The only exception I see in the code is if 
> there are balancer addresses successfully returned 
> <https://github.com/grpc/grpc/blob/9794038ae03842573517411df4ef6ac87a377be0/src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc#L325>,
>  
> but that shouldn't be the case if you're using pick_first.  Unless maybe 
> you're using a service config in DNS, but the service config lookup is 
> failing also?
>
> Anyway, getting some additional logs will probably help us understand 
> what's going wrong here.
>
> On Wed, Aug 10, 2022 at 6:41 AM 'Peter Hurley' via grpc.io <
> grp...@googlegroups.com> wrote:
>
>> Thanks for the reply.
>>
>> > And would it be possible for you to upgrade your gRPC library and try 
>> to reproduce this? 
>> I didn't see any similar issue (marked fixed or not) in 
>> https://github.com/grpc/grpc/issues; we were hoping the community could 
>> confirm whether this has been observed and fixed already but went 
>> unreported in github.
>>
>> > v1.36.4 is over a year old, and a fair handful of bug fixes have gone 
>> in since then.
>> We're using the still experimental TLSCredentials so every version bump 
>> is non-trivial, and we've already found fixed a number of core bugs 
>> ourselves, so it'll be a while before we're upgrading again in production.
>>
>> > Regarding that, are you able to reproduce the conditions in which the 
>> failure occurs, or are they maybe not fully understood? e.g., run a local 
>> DNS server for testing, and modify its records.
>> Yeah, the exact conditions are not well understood, but almost certainly 
>> happening during a restart of the local caching dnsmasq server due to 
>> intermittent connection loss.
>>
>>
>> On Fri, Aug 5, 2022 at 8:35 PM 'AJ Heller' via grpc.io <
>> grp...@googlegroups.com> wrote:
>>
>>> That's mysterious, do you know what the state of the DNS records are 
>>> when this occurs? And would it be possible for you to upgrade your gRPC 
>>> library and try to reproduce this? v1.36.4 is over a year old, and a fair 
>>> handful of bug fixes have gone in since then.
>>>
>>> We've been unable to reproduce this failure in testing, and would 
>>>> appreciate any pointers:
>>>>
>>>
>>> Regarding that, are you able to reproduce the conditions in which the 
>>> failure occurs, or are they maybe not fully understood? e.g., run a local 
>>> DNS server for testing, and modify its records.
>>>  
>>>
>>>>
>>>>    - what is supposed to re-kick a new DNS resolve if the server list 
>>>>    is empty?
>>>>    - where to check in the resolver code for an empty server list?
>>>>    - or any other ideas for how to track down the problem
>>>>
>>>>
>>>> We're using grpc v1.36.4 w/ libcares2 1.14
>>>>
>>>> Regards,
>>>> Peter Hurley
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "grpc.io" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to grpc-io+u...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/grpc-io/306779dd-0a68-4b95-851e-0a5979a4e872n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/grpc-io/306779dd-0a68-4b95-851e-0a5979a4e872n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "grpc.io" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to grpc-io+u...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/grpc-io/CAKzaEUf00rkYWHD6aq1nks8WhVo59wrTcaspkMk2EHUDc1b0JQ%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/grpc-io/CAKzaEUf00rkYWHD6aq1nks8WhVo59wrTcaspkMk2EHUDc1b0JQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> -- 
> Mark D. Roth <ro...@google.com>
> Software Engineer
> Google, Inc.
>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/5facbe71-8d1e-4b7d-8ea6-4030f0e2d6dan%40googlegroups.com.

Reply via email to