Re: ELB scaling => sudden backend tragedy

Lukas Tribus Thu, 24 Oct 2019 10:45:20 -0700

Hello,

On Thu, Oct 24, 2019 at 5:53 PM Jim Freeman <sovr...@gmail.com> wrote:
>
> Yesterday we had an ELB scale to 26 IP addresses, at which time ALL of the 
> servers in that backend were suddenly marked down, e.g. :
>
>    Server www26 is going DOWN for maintenance (unspecified DNS error)
>
> Ergo, ALL requests to that backend got 503s ==> complete outage
>
> Mayhap src/dns.c::dns_validate_dns_response() bravely running away when 
> DNS_RESP_TRUNCATED (skipping parsing of the partial list of servers, 
> abandoning TTL updates to perfectly good endpoints) is not the best course of 
> action ?
>
> Of course we'll hope (MTUs allowing) that we'll be able to paper this over 
> for awhile using an accepted_payload_size >default(512).


I agree this is basically a ticking time-bomb for everyone not
thinking about the DNS payload size every single day.

However we also need to make sure people will become aware of it when
they are hitting truncation size. This would have to be at least a
warning on critical syslog level.


Reliable DNS resolution for everyone without surprises will only
happen with TCP based DNS:
https://github.com/haproxy/haproxy/issues/185

For the issue in question on the other hand: can you file a bug on github?



Thanks,

Lukas

Re: ELB scaling => sudden backend tragedy

Reply via email to