Hi Lukas, On Thu, Nov 05, 2020 at 06:40:50PM +0100, Lukas Tribus wrote: > What I don't like are code/subsystems that are not sufficiently > covered maintenance- and maintainer-wise (whatever the reason may be). > > In my opinion, the resolver code is like that today: > > - issues (including bugs) are open for years > - it's riddled with traps for the users that will suddenly blow up in > their faces (lack of TCP support, IPv4 vs IPv6) > - important discussions have come to a halt
Yes, I agree with you regarding this unpleasant situation, and actually having someone else work on it is also one way to add resources to this part. The main issue that plagued DNS is that the use cases have fundamentally changed over time and it has constantly been abused to go a bit further. Initially it was only designed to support an automatic IP address change of your AWS machine that just rebooted. Then it had to evolve to support late resolution. Then deduplication to support setting up farms. Then SRV records, etc. And looking back, I can also say "what a monster we've done". The problem here is not directly related to the DNS implementation per se, but its integration withinn server farms and its actions there, because, by design, this is supposed to do counter-intuitive things consisting in changing some settings from those that a user has carefully placed in a configuration, which themselves possibly contradict others from a state file. Adding to this that partial responses must not cause the immediate removal of absent entries, and that some people expect their multiple LBs to by consistent where the protocol does not offer this consistency, and all this while still trying not to break the initial use case, we can easily see the total mess this has engendered. And I guess that such irreconciliable use cases have not really helped propose durable solutions to a number of issues. My vision on this is that we should not have ceeded on the abuses nor demands of abuse of the initial DNS features, but instead we should have created a completely independent discovery mechanism. It's obviously easy to say after the trouble, with the architectural foundations available in 2020 that were not in 2015. But for me, discovery is not DNS, discovery may use DNS or other services, but it's not the same thing as just resolving a server name that may change at run time, and that's what needs to be worked on separately. What I'd like to see is the DNS protocol being updated to support TCP, the DNS stack being possibly made even more modular (possibly separating the message processing from the resolving), and known issues addressed, even if this requires the addition of a few options or keyword to choose between one behavior or another, instead of just crossing fingers. > I cannot help here (other than explaining why some current behaviours > are bad and triaging the bugs on GH, which is also lacking: most dns > issues do not even have the dns subsystem label). All this blunt > critique without providing suggestions to improve the situation is > rude, but since we are discussing DNS load-balancing (which sounds > like adding new fuel to the fire to me), apparently with the same > amount of resources and enthusiasm, I am concerned that we will end up > in the same or worse situation, which is why I have to share my > (negative) opinion about the current situation. I totally understand. I don't see your point as a blunt critique nor any form of negative feedback, quite the contrary. You're the one who deals with the most user reports and knows best what works, what doesn't, and what traps users fall into. I do really valuate your feedback on this. Rest assured that for me it is also a concern, as I don't like to know that some areas are a bit unstable nor to think "wow, should this work at all?" when seeing a config. And we know there is another area suffering from similar traps (though much less), which is the server-state-file, just because, similarly, it deals with conflicts between a supposed state and a configured state. I, too, would like to see these points addressed, if possible for 2.4, so that we don't have to wonder anymore if a config will work. This will require breaking changes, but likely for good given that users regularly fall into traps. For the DNS resolvers in my opinion the technical issues like protocol limitations should be within reach. For the state file, by separating the administrative and operational states, and using a new format, we should also address the concerns and at the same time make them work better in relation with other dynamic changes (DNS included). > > hate the noise that some people regularly make about "UDP support" > > I am *way* more concerned about what to tell people when they report > redundant production systems meltdowns because of the traps that we > knew about for a long time and never improved. Like when the DNS > response size surpasses accepted_payload_size and we don't have a TCP > fallback, This one should be addressed once TCP support is implemented. But here again, I'm not interested in implementing fallbacks. We're not supposed to be dealing with unknown public servers when it comes to discovery (which is the case where large responses are expected), so I'd rather have resolvers configured for a simple resolving use case (e.g. server address change, UDP) or discovery (TCP only). > or we don't force the users to specify the address-family > for resolution, which is of course very wrong on a load-balancer. Are you suggesting that we should use IPv4 only unless IPv6 is set, and that under no circumstance we switch from one to the other ? I remember that this was a difficult choice initially because we were trying to deal with servers migrating to another region and being available using a different protocol (I'm not personally fan of mixing address families for a same server in a load balancer, but I'm pretty certain we clearly identified this need). But again while it's understandable for certain (older?) cases, it's very likely that it makes absolutely no sense for discovery. > Of course I understand the DNS resolver code has nothing to do with > future DNS load-balancing code. Not completely. In order to reduce bugs we should also continue to work with reusable protocol layers. For example 10 years ago the stats page was just a hack pushing bytes on the wire when a certain URI was detected. If a connection error had to be diagnosed, it could have come from anywhere. Nowadays it's an HTTP service running in an applet, which can be compressed or have headers adjusted, which can be sent over H1/H2 +/- SSL etc. It's much more segmented and bugs are way easier to spot. For the DNS I'd like to be sure that once the lower level speaks DNS, it can work with UDP/TCP/whatever and expose its services to the resolvers or to the load balancing. Maybe the matrix will not be full between them, it's not dramatic. But at least DNS-based issues will be addressed into the DNS layer and resolver/discovery/balancing issues at another layer (and maybe even by different people over time). > But the fact of the matter is that a > new subsystems/featureset require sustained effort, time and frankly > also interest. ... and skills. This is also how a better layering does help. Finding someone who's at ease both with on-wire protocol format and the latest end-user discovery stuff of the day is extremely hard. Finding people at ease and interested with their respective layers is much easier. What is interesting with the thinking around the LB mechanism is that it constantly forces one to think in terms of proxying, which requires to adapt to both sides. "how do I pass a large TCP response into small UDP packets" for example will help figure the best place to set some protocol-level knobs, the message-level adaptations, the cache etc. I definitely want this to be taken into account during the relayering of the DNS stack. And all of this is also what motivates Emeric to know how far it is reasonable to go on the server-side support, with what impacts on existing features, if any. Last point, based on your observations, do you think that some features are useless, confusing or misused and should be either removed or have their defaults changed ? I don't have anything in mind, but if Emeric needs to fight with some of them when trying to implement the TCP part, just to discover at the end that the thing that was the hardest to torn is useless or even wrong, that's a waste of energy and it adds to the technical debt, so we'd rather know early. Thanks! Willy