Hi Lukas,

On Thu, Mar 13, 2025 at 02:38:31PM +0100, Lukas Tribus wrote:
> You are thinking of a case where resolv.conf points to some recursive
> nameserver, and the haproxy configuration resolver config points to
> different ones.
> 
> However DNS is complex and there are *a lot* of behavior differences
> one can shoot himself in the foot with, other than using different
> servers.
> 
> 
> - for libc we can use gethostbyname or getaddrinfo based on how
> haproxy was build, impacting address family results
> - resolv.conf does not force udp or tcp, libc decides, and in most but
> not all libc's a UDP query automatically falls back to TCP
> - haproxy resolvs explicitly either via UDP or TCP, it is the user
> responsibility to fix issue and configure fallbacks
> - haproxy only supports FQDN while libc may search domains (man 5
> resolv.conf has lots of options)
> - handling bigger responses is likely different
> - EDNS0 handling is likely different
> - handling of DNS flags is likely different

OK, I never thought about all of this!

> For example case 1:
> An admin configures private resolvers in TCP mode to avoid issues with
> bigger response sizes, however unbeknownst to him TCP mode is not
> available/reachable for unrelated issues. The same name servers are
> configured in /etc/resolv.conf, so libc is able to resolve the private
> server IPs without issues, because libc uses UDP before falling back
> to TCP.

I generally agree (though I don't know how frequent this is). Also
this raises the point of the relevance of parse-resolv-conf then:
if discrepancies are this common, should we also discourage from
using this option ?

> How much time and back and forth does this need in a support call, to
> find out that the haproxy internal resolver never run-time *updated*
> the server IPs because it never worked in the first place, hidden by
> the libc resolver which makes everything apparently work, when it
> would have been immediately obvious if libc resolution was disabled?

Oh yes. You know I'm for failing early!

> For example case 2:
> Lack of FQDN: same as case 1, libc searches a hostname in the local
> domain, haproxy does not. Again the internal resolver will fail to
> update server IPs and libc will hide this problem for some time.

Hmmm interesting as well, and trivial to trigger. Even just having
an FQDN but the entry in /etc/hosts.

> Even Luke's problem in this case was not really related to the
> difference results of nameservers, but the distinction between where
> libc resolving stops and haproxy internal resolving starts.
> 
> Every subtle difference in behavior can make the difference between a
> simple and a complex diagnosis, when two different implementations are
> involved, whether the root cause is an external factor, a local
> misconfiguration or a bug.

Based on your explanation, I agree. I do think, however, that this is
far from being obvious and needs to be mentioned somewhere. In the doc.
I always consider that a user asking for help or reporting an issue is
a failure of the doc.

> > > -carry on doing the first resolution when parsing the configuration.
> > > +keep trying to resolve names at startup during configuration parsing via 
> > > libc
> > > +for backwards compatibility.
> >
> > "keep trying" makes me think it insists, which is not true because at
> > the first error it fails to start. However, the libc resolvers are
> > generally blocking, and can be slow since serialized. Probably that
> > all of these concepts should be handled to clarify the picture.
> 
> Yeah, I didn't like "carry on", but it also works without it:
> 
> > Whether run time server name resolution has been enable or not, HAProxy will
> > do the first resolution at startup during configuration parsing via libc
> > for backwards compatibility.

OK.

> > Something along these lines maybe ?
> >
> >   Unless explicitly disabled via the server "init-addr" keyword, HAProxy
> >   will resolve server addresses on startup using the standard method
> >   provided by the operating system's C library ("libc"). It is important
> >   to understand that while this resolution generally relies on DNS, it
> >   can also involve other mechanisms that are specific to the deployment.
> >   If an address cannot be resolved, the process will stop with an error.
> >   In addition, resolutions are serialized, so that resolving addresses
> >   for 1000 servers will result in 1000 request-response cycles, which
> >   can take quite some time. Also, when DNS servers are unreachable or
> >   unresponsive, the libc can take a very long time before timing out for
> >   each and every server, rendering a startup impractical. Finally, if the
> >   servers are configured to rely on a "resolvers" section that references
> >   different DNS servers, the response from the libc might cause startup
> >   errors, or worse, long delays. For this reason it is important not to
> >   mix libc with other resolvers, and adjust the "init-addr" server setting
> >   according to the desired behavior.
> >
> > I'm fine with any other proposal, I just want to be sure that these points
> > are clarified, because clearly the DNS part in the doc suffers quite a bit
> > and would deserve a refresh!
> 
> I really want to drive home the point that this is not *only* about
> different DNS servers, but also about different resolution behavior
> when using the same DNS servers, because our code and libc code is not
> the same.

Totally got it now, thank you.

I'm fine then with generally discouraging people from using the two at
once.

Maybe we could think about changing the default init-addr over the
long term when resolvers are used (not sure this is easy to do, think
about defaults). Or maybe we could add another variant of libc, such as
"opt-libc" which would be libc only if resolvers is not used, and switch
to that by default. What do you think ?

The only problem is that I'd like to be able to emit a warning before
changing that, and we probably don't want to force everyone to specify
init-addr everywhere if we'd want to change a default later. Or maybe
the resolution error should detect that resolvers are there and report
that the setting changed. It would not be super cool but it could help
users avoid serious traps.

And the other thing is to improve the doc. I'm pretty sure I'm far from
being the only one around not thinking about all these "details" between
libc and resolvers.

I may try to write some text about that but I must confess I don't feel
much easy about it, so I'll need some review.

Thanks,
Willy


Reply via email to