Am Wed, 10 May 2017 22:12:37 +0200 schrieb Kai Krakow <hurikha...@gmail.com>:
> Am Tue, 9 May 2017 20:37:16 +0200 > schrieb Lennart Poettering <lenn...@poettering.net>: > > > On Tue, 09.05.17 00:42, Kai Krakow (hurikha...@gmail.com) wrote: > > > > > Am Sat, 6 May 2017 14:22:21 +0200 > > > schrieb Kai Krakow <hurikha...@gmail.com>: > > > > [...] > > [...] > > [...] > > [...] > [...] > > > > > > Fixed by restarting the router. The cable modem seems to be buggy > > > with UDP packets after a lot of uptime: it simply silently drops > > > UDP packets at regular intervals, WebUI was also very slow, > > > probably a CPU issue. > > > > > > I'll follow up on this with the cable provider. > > > > > > When the problem starts to show up, systemd-resolved is affected > > > more by this than direct resolving. I don't know if there's > > > something that could be optimized in systemd-resolved to handle > > > such issues better but I don't consider it a bug in > > > systemd-resolved, it was a local problem. > > > > Normally configured DNS servers should be equivalent, and hence > > switching them for each retry should not come at any cost, hence, > > besides the extra log output, do you experience any real issues? > > Since I restarted the router, there are no longer any such logs except > maybe a few per day (less than 4). > > But when I got those logs spammed to the journal, the real problem was > the DNS resolver taking 10s about once per minute to resolve a website > address - which really was a pita. > > But well, what could systemd-resolved have done about it when the real > problem was some network equipment? > > I just wonder why it was less visible when directly using those DNS > servers. Since DNS must have been designed with occassional packet > loss in mind (because it uses UDP), there must be a way to handle this > better. So I read a little bit in > https://www.ietf.org/rfc/rfc1035.txt. > > RFC1035 section 4.2.1 suggests that the retransmission interval for > queries should be 2-5 seconds, depending on statistics of previous > queries. To me, "retransmissions" means the primary DNS server should > not be switched for each query timeout it got (while still allowing to > transfer the same request to the next available server). > > RFC1035 section 7 discusses the suggested implementation of the > resolver and covers retransmission and server selection algorithms: > > It suggests to record average response time for each server it queries > to select the ones which respond faster first. Without query history, > the selection algorithm should pretend a response time of 5-10 > seconds. > > It also suggests to switch the primary server only when there was some > "bizarre" error or a server error reply. However, I don't think it > should actually remove them from the list as suggested there since we > are a client resolver, not a server resolver which can update its peer > lists from neighbor servers. However, we could reset query time > statistics to move it to the end of the list, and/or blacklist it for > a while. > > Somewhere else in that document I've read that it is well permitted to > interleave multiple parallel requests to multiple DNS servers in the > list. So I guess it would be nice and allowed if systemd-resolved used > more than only one DNS server at the same time by alternating between > them each request - maybe taking the two best according to query time > statistics. > > I also guess that it should maybe use shorter timeouts for queries as > you could have more than one DNS server and the initial query time > statistic should pretend 5-10 seconds, while the rotation interval > suggests 2-5 seconds. > > I think it would work to have "10 seconds divided by servers count" or > 2 seconds, whatever is bigger, as a timeout for query rotation. But a > late reply should still be accepted as pointed out in section 7.3, > even when the query was already rotated to the next DNS server. Using > only a single DNS server can skip all this logic as there's no > rotation and would work with timeouts of 10 seconds. > > That way, systemd-resolved would "learn" to use only the fastest DNS > server and when it becomes too slow (which is 5-10 seconds based on > the RFC), it would switch to the next server. If parallel requests > come in, it would use more DNS servers from the list in parallel, > auto-sorted by query reply time. The startup order is the one given > by the administrator (or whatever provides the DNS server list). > > Applied to my UDP packet loss (which seem to be single packet losses > as an immediate next request would've got a reply), it would mean > that the systemd resolver gives me an address after 2-3 seconds > instead of 5 or 10 because I had 4 DNS servers on that link. This is > more or less what I've seen previously in my situation when I > switched back to direct resolving instead of through systemd-resolved. > > What do you think? I could think of this being an implementation > improvement project in the Github bug tracker. I would be willing to > work on such an improvement given that the existing code is not too > difficult to understand since I'm not a C pro (my last bigger project > is like 10 years ago). Apparently, such an improvement could only be a > spare time project for me, taking some time. > > Of course, all this only works when all DNS servers on the same link > can resolve the same zones. Otherwise you will get a lot of timeouts > and switching. I know that some people prefer to give DNS server IPs > that resolve different zones (or can only resolve some). I never > understood why one would supply such a mis-configuration, and it is > mostly seen only in the Windows world, but well: it exists. A warning > in the log when detecting such a situation could fix this. In the end, > systemd-resolved is able to correctly handle per-link DNS servers. I've prepared a patch and added a pull request: https://github.com/systemd/systemd/pull/5953 I saw that the basic infrastructure as laid out in RFC1035 is there but the current implementation can only increase timeouts and never lowers them, resulting in timeouts hard limited to 5 seconds all the time. That way, occasional UDP packet drops result in always 5 second stalls. In case there are bursts of packet drops, you would easily get DNS timeouts in clients (tho, this seems to have been fixed already in systemd-resolved as I now longer saw that effect), or at least stalls of 5, 10, or 15 seconds which is very annoying. With this patch, after a DNS server recovers, the timeouts are lowered again so we get faster failover to the next server in case a packet is lost again. It works pretty well for web browsing here. I didn't test the effects on LLMNR and mDNS as I cannot really use them here with a proper test scenario. -- Regards, Kai Replies to list-only preferred. _______________________________________________ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel