Re: [grpc-io] Re: grpc stops forward progress if DNS resolve has 0 addresses

2022-08-31 Thread 'Mark D. Roth' via grpc.io
Looking at our code more closely, it looks like there is a bug here.  If
the resolver returns an error for the addresses on the very first
resolution attempt, it looks like we will get into a state where nothing
will re-resolve.

It looks like this bug has been here for a long time, so I'm surprised no
one has run into it until now.  It definitely needs to be fixed, but it'll
take a bit of work to make all the pieces work together the right way.  Can
you please file an issue and tag me on it?

Thanks very much for reporting this!

On Mon, Aug 29, 2022 at 3:25 PM 'Chi Jameson' via grpc.io <
grpc-io@googlegroups.com> wrote:

> Hello!
>
> We've been able to locate where the client channel stops attempting to
> reconnect, but haven't found how/why the c-ares resolver successfully
> passes a 0 address list to the pick_first load balancer. What appears to be
> happening is it hits this 0 addresses check
> 
>  and
> causes a TRANSIENT_FAILURE, but then the client channel never responds
> beyond that. We've seen this same freeze happen in v1.46.4 at the same 
> subchannel
> list check
> ,
> but I don't necessarily know if it's possible to hit that if statement in a
> practice as the reproduction method I've used is to simply provide an empty
> address list to that method. Trying to actually get an empty c-ares to
> reproduce the behavior we're seeing has proven to be difficult as most of
> the time the resolver behaves as you mentioned.
>
> What normally listens for the UpdateState from the channel_control_helper?
> That might give us a good hint for why the client channel stops after that
> point.
>
> Thanks!
> Chi
> Cisco Meraki
> On Wednesday, August 17, 2022 at 11:35:47 AM UTC-6 Mark D. Roth wrote:
>
>> Can you try running with the following environment variables set, and
>> share the log?  That might help us figure out what's going on here.
>>
>> GRPC_VERBOSITY=DEBUG
>> GRPC_TRACE=client_channel_routing,pick_first,cares_resolver
>>
>> In general, the c-ares resolver should return an error when there's an
>> empty address list, so it should automatically retry the resolution
>> periodically until it succeeds.  The only exception I see in the code is if
>> there are balancer addresses successfully returned
>> ,
>> but that shouldn't be the case if you're using pick_first.  Unless maybe
>> you're using a service config in DNS, but the service config lookup is
>> failing also?
>>
>> Anyway, getting some additional logs will probably help us understand
>> what's going wrong here.
>>
>> On Wed, Aug 10, 2022 at 6:41 AM 'Peter Hurley' via grpc.io <
>> grp...@googlegroups.com> wrote:
>>
>>> Thanks for the reply.
>>>
>>> > And would it be possible for you to upgrade your gRPC library and try
>>> to reproduce this?
>>> I didn't see any similar issue (marked fixed or not) in
>>> https://github.com/grpc/grpc/issues; we were hoping the community could
>>> confirm whether this has been observed and fixed already but went
>>> unreported in github.
>>>
>>> > v1.36.4 is over a year old, and a fair handful of bug fixes have gone
>>> in since then.
>>> We're using the still experimental TLSCredentials so every version bump
>>> is non-trivial, and we've already found fixed a number of core bugs
>>> ourselves, so it'll be a while before we're upgrading again in production.
>>>
>>> > Regarding that, are you able to reproduce the conditions in which the
>>> failure occurs, or are they maybe not fully understood? e.g., run a local
>>> DNS server for testing, and modify its records.
>>> Yeah, the exact conditions are not well understood, but almost certainly
>>> happening during a restart of the local caching dnsmasq server due to
>>> intermittent connection loss.
>>>
>>>
>>> On Fri, Aug 5, 2022 at 8:35 PM 'AJ Heller' via grpc.io <
>>> grp...@googlegroups.com> wrote:
>>>
 That's mysterious, do you know what the state of the DNS records are
 when this occurs? And would it be possible for you to upgrade your gRPC
 library and try to reproduce this? v1.36.4 is over a year old, and a fair
 handful of bug fixes have gone in since then.

 We've been unable to reproduce this failure in testing, and would
> appreciate any pointers:
>

 Regarding that, are you able to reproduce the conditions in which the
 failure occurs, or are they maybe not fully understood? e.g., run a local
 DNS server for testing, and modify its records.


>
>- what is supposed to re-kick a new DNS resolve if the server list
>is empty?
>- where to check in the resolver code for an empty server list?

Re: [grpc-io] Re: grpc stops forward progress if DNS resolve has 0 addresses

2022-08-29 Thread 'Chi Jameson' via grpc.io
Hello!

We've been able to locate where the client channel stops attempting to 
reconnect, but haven't found how/why the c-ares resolver successfully 
passes a 0 address list to the pick_first load balancer. What appears to be 
happening is it hits this 0 addresses check 

 and 
causes a TRANSIENT_FAILURE, but then the client channel never responds 
beyond that. We've seen this same freeze happen in v1.46.4 at the same 
subchannel 
list check 
,
 
but I don't necessarily know if it's possible to hit that if statement in a 
practice as the reproduction method I've used is to simply provide an empty 
address list to that method. Trying to actually get an empty c-ares to 
reproduce the behavior we're seeing has proven to be difficult as most of 
the time the resolver behaves as you mentioned.

What normally listens for the UpdateState from the channel_control_helper? 
That might give us a good hint for why the client channel stops after that 
point.

Thanks!
Chi
Cisco Meraki
On Wednesday, August 17, 2022 at 11:35:47 AM UTC-6 Mark D. Roth wrote:

> Can you try running with the following environment variables set, and 
> share the log?  That might help us figure out what's going on here.
>
> GRPC_VERBOSITY=DEBUG
> GRPC_TRACE=client_channel_routing,pick_first,cares_resolver
>
> In general, the c-ares resolver should return an error when there's an 
> empty address list, so it should automatically retry the resolution 
> periodically until it succeeds.  The only exception I see in the code is if 
> there are balancer addresses successfully returned 
> ,
>  
> but that shouldn't be the case if you're using pick_first.  Unless maybe 
> you're using a service config in DNS, but the service config lookup is 
> failing also?
>
> Anyway, getting some additional logs will probably help us understand 
> what's going wrong here.
>
> On Wed, Aug 10, 2022 at 6:41 AM 'Peter Hurley' via grpc.io <
> grp...@googlegroups.com> wrote:
>
>> Thanks for the reply.
>>
>> > And would it be possible for you to upgrade your gRPC library and try 
>> to reproduce this? 
>> I didn't see any similar issue (marked fixed or not) in 
>> https://github.com/grpc/grpc/issues; we were hoping the community could 
>> confirm whether this has been observed and fixed already but went 
>> unreported in github.
>>
>> > v1.36.4 is over a year old, and a fair handful of bug fixes have gone 
>> in since then.
>> We're using the still experimental TLSCredentials so every version bump 
>> is non-trivial, and we've already found fixed a number of core bugs 
>> ourselves, so it'll be a while before we're upgrading again in production.
>>
>> > Regarding that, are you able to reproduce the conditions in which the 
>> failure occurs, or are they maybe not fully understood? e.g., run a local 
>> DNS server for testing, and modify its records.
>> Yeah, the exact conditions are not well understood, but almost certainly 
>> happening during a restart of the local caching dnsmasq server due to 
>> intermittent connection loss.
>>
>>
>> On Fri, Aug 5, 2022 at 8:35 PM 'AJ Heller' via grpc.io <
>> grp...@googlegroups.com> wrote:
>>
>>> That's mysterious, do you know what the state of the DNS records are 
>>> when this occurs? And would it be possible for you to upgrade your gRPC 
>>> library and try to reproduce this? v1.36.4 is over a year old, and a fair 
>>> handful of bug fixes have gone in since then.
>>>
>>> We've been unable to reproduce this failure in testing, and would 
 appreciate any pointers:

>>>
>>> Regarding that, are you able to reproduce the conditions in which the 
>>> failure occurs, or are they maybe not fully understood? e.g., run a local 
>>> DNS server for testing, and modify its records.
>>>  
>>>

- what is supposed to re-kick a new DNS resolve if the server list 
is empty?
- where to check in the resolver code for an empty server list?
- or any other ideas for how to track down the problem


 We're using grpc v1.36.4 w/ libcares2 1.14

 Regards,
 Peter Hurley

>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "grpc.io" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to grpc-io+u...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/grpc-io/306779dd-0a68-4b95-851e-0a5979a4e872n%40googlegroups.com
>>>  
>>> 
>>> .
>>>
>> 

Re: [grpc-io] Re: grpc stops forward progress if DNS resolve has 0 addresses

2022-08-10 Thread 'Peter Hurley' via grpc.io
Thanks for the reply.

> And would it be possible for you to upgrade your gRPC library and try to
reproduce this?
I didn't see any similar issue (marked fixed or not) in
https://github.com/grpc/grpc/issues; we were hoping the community could
confirm whether this has been observed and fixed already but went
unreported in github.

> v1.36.4 is over a year old, and a fair handful of bug fixes have gone in
since then.
We're using the still experimental TLSCredentials so every version bump is
non-trivial, and we've already found fixed a number of core bugs
ourselves, so it'll be a while before we're upgrading again in production.

> Regarding that, are you able to reproduce the conditions in which the
failure occurs, or are they maybe not fully understood? e.g., run a local
DNS server for testing, and modify its records.
Yeah, the exact conditions are not well understood, but almost certainly
happening during a restart of the local caching dnsmasq server due to
intermittent connection loss.


On Fri, Aug 5, 2022 at 8:35 PM 'AJ Heller' via grpc.io <
grpc-io@googlegroups.com> wrote:

> That's mysterious, do you know what the state of the DNS records are when
> this occurs? And would it be possible for you to upgrade your gRPC library
> and try to reproduce this? v1.36.4 is over a year old, and a fair handful
> of bug fixes have gone in since then.
>
> We've been unable to reproduce this failure in testing, and would
>> appreciate any pointers:
>>
>
> Regarding that, are you able to reproduce the conditions in which the
> failure occurs, or are they maybe not fully understood? e.g., run a local
> DNS server for testing, and modify its records.
>
>
>>
>>- what is supposed to re-kick a new DNS resolve if the server list is
>>empty?
>>- where to check in the resolver code for an empty server list?
>>- or any other ideas for how to track down the problem
>>
>>
>> We're using grpc v1.36.4 w/ libcares2 1.14
>>
>> Regards,
>> Peter Hurley
>>
> --
> You received this message because you are subscribed to the Google Groups "
> grpc.io" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to grpc-io+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/grpc-io/306779dd-0a68-4b95-851e-0a5979a4e872n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/CAKzaEUf00rkYWHD6aq1nks8WhVo59wrTcaspkMk2EHUDc1b0JQ%40mail.gmail.com.