Hi Lukas,

On Thu, Nov 05, 2020 at 06:40:50PM +0100, Lukas Tribus wrote:
> What I don't like are code/subsystems that are not sufficiently
> covered maintenance- and maintainer-wise (whatever the reason may be).
> 
> In my opinion, the resolver code is like that today:
> 
> - issues (including bugs) are open for years
> - it's riddled with traps for the users that will suddenly blow up in
> their faces (lack of TCP support, IPv4 vs IPv6)
> - important discussions have come to a halt

Yes, I agree with you regarding this unpleasant situation, and actually
having someone else work on it is also one way to add resources to this
part.

The main issue that plagued DNS is that the use cases have fundamentally
changed over time and it has constantly been abused to go a bit further.
Initially it was only designed to support an automatic IP address change
of your AWS machine that just rebooted. Then it had to evolve to support
late resolution. Then deduplication to support setting up farms. Then
SRV records, etc. And looking back, I can also say "what a monster we've
done". The problem here is not directly related to the DNS implementation
per se, but its integration withinn server farms and its actions there,
because, by design, this is supposed to do counter-intuitive things
consisting in changing some settings from those that a user has carefully
placed in a configuration, which themselves possibly contradict others
from a state file. Adding to this that partial responses must not cause
the immediate removal of absent entries, and that some people expect their
multiple LBs to by consistent where the protocol does not offer this
consistency, and all this while still trying not to break the initial use
case, we can easily see the total mess this has engendered. And I guess
that such irreconciliable use cases have not really helped propose durable
solutions to a number of issues.

My vision on this is that we should not have ceeded on the abuses nor
demands of abuse of the initial DNS features, but instead we should have
created a completely independent discovery mechanism. It's obviously easy
to say after the trouble, with the architectural foundations available in
2020 that were not in 2015. But for me, discovery is not DNS, discovery
may use DNS or other services, but it's not the same thing as just
resolving a server name that may change at run time, and that's what
needs to be worked on separately.

What I'd like to see is the DNS protocol being updated to support TCP,
the DNS stack being possibly made even more modular (possibly separating
the message processing from the resolving), and known issues addressed,
even if this requires the addition of a few options or keyword to choose
between one behavior or another, instead of just crossing fingers.

> I cannot help here (other than explaining why some current behaviours
> are bad and triaging the bugs on GH, which is also lacking: most dns
> issues do not even have the dns subsystem label). All this blunt
> critique without providing suggestions to improve the situation is
> rude, but since we are discussing DNS load-balancing (which sounds
> like adding new fuel to the fire to me), apparently with the same
> amount of resources and enthusiasm, I am concerned that we will end up
> in the same or worse situation, which is why I have to share my
> (negative) opinion about the current situation.

I totally understand. I don't see your point as a blunt critique nor any
form of negative feedback, quite the contrary. You're the one who deals
with the most user reports and knows best what works, what doesn't, and
what traps users fall into. I do really valuate your feedback on this.
Rest assured that for me it is also a concern, as I don't like to know
that some areas are a bit unstable nor to think "wow, should this work at
all?" when seeing a config. And we know there is another area suffering
from similar traps (though much less), which is the server-state-file,
just because, similarly, it deals with conflicts between a supposed
state and a configured state.

I, too, would like to see these points addressed, if possible for 2.4,
so that we don't have to wonder anymore if a config will work. This will
require breaking changes, but likely for good given that users regularly
fall into traps. For the DNS resolvers in my opinion the technical issues
like protocol limitations should be within reach. For the state file, by
separating the administrative and operational states, and using a new
format, we should also address the concerns and at the same time make
them work better in relation with other dynamic changes (DNS included).

> > hate the noise that some people regularly make about "UDP support"
> 
> I am *way* more concerned about what to tell people when they report
> redundant production systems meltdowns because of the traps that we
> knew about for a long time and never improved. Like when the DNS
> response size surpasses accepted_payload_size and we don't have a TCP
> fallback,

This one should be addressed once TCP support is implemented. But here
again, I'm not interested in implementing fallbacks. We're not supposed
to be dealing with unknown public servers when it comes to discovery
(which is the case where large responses are expected), so I'd rather
have resolvers configured for a simple resolving use case (e.g. server
address change, UDP) or discovery (TCP only).

> or we don't force the users to specify the address-family
> for resolution, which is of course very wrong on a load-balancer.

Are you suggesting that we should use IPv4 only unless IPv6 is set, and
that under no circumstance we switch from one to the other ? I remember
that this was a difficult choice initially because we were trying to deal
with servers migrating to another region and being available using a
different protocol (I'm not personally fan of mixing address families
for a same server in a load balancer, but I'm pretty certain we clearly
identified this need). But again while it's understandable for certain
(older?) cases, it's very likely that it makes absolutely no sense for
discovery.

> Of course I understand the DNS resolver code has nothing to do with
> future DNS load-balancing code.

Not completely. In order to reduce bugs we should also continue to work
with reusable protocol layers. For example 10 years ago the stats page
was just a hack pushing bytes on the wire when a certain URI was detected.
If a connection error had to be diagnosed, it could have come from anywhere.
Nowadays it's an HTTP service running in an applet, which can be compressed
or have headers adjusted, which can be sent over H1/H2 +/- SSL etc. It's
much more segmented and bugs are way easier to spot.

For the DNS I'd like to be sure that once the lower level speaks DNS, it
can work with UDP/TCP/whatever and expose its services to the resolvers
or to the load balancing. Maybe the matrix will not be full between them,
it's not dramatic. But at least DNS-based issues will be addressed into
the DNS layer and resolver/discovery/balancing issues at another layer
(and maybe even by different people over time).

> But the fact of the matter is that a
> new subsystems/featureset require sustained effort, time and frankly
> also interest.

... and skills. This is also how a better layering does help. Finding
someone who's at ease both with on-wire protocol format and the latest
end-user discovery stuff of the day is extremely hard. Finding people
at ease and interested with their respective layers is much easier.

What is interesting with the thinking around the LB mechanism is that
it constantly forces one to think in terms of proxying, which requires
to adapt to both sides. "how do I pass a large TCP response into small
UDP packets" for example will help figure the best place to set some
protocol-level knobs, the message-level adaptations, the cache etc. I
definitely want this to be taken into account during the relayering of
the DNS stack. And all of this is also what motivates Emeric to know
how far it is reasonable to go on the server-side support, with what
impacts on existing features, if any.

Last point, based on your observations, do you think that some features
are useless, confusing or misused and should be either removed or have
their defaults changed ? I don't have anything in mind, but if Emeric
needs to fight with some of them when trying to implement the TCP part,
just to discover at the end that the thing that was the hardest to torn
is useless or even wrong, that's a waste of energy and it adds to the
technical debt, so we'd rather know early.

Thanks!
Willy

Reply via email to