Hi knot-dns-users, I have knotd as a slave for a dnsmasq and I'm seeing fallback from IXFR to AXFR not working properly. TLDR: I have patches and while I'm waiting for my gitlab access to be allowed to submit MRs I wrote this email :)
What happens is that the first AXFR goes through and everything looks happy dandy, but if I restart dnsmasq to force a SOA update knot tries to do an IXFR which fails and doesn't fall back to AXFR. First the AXFR works: knotd: info: [myzone.example.net.] AXFR, incoming, remote 2001:db8::2@53, started knotd: [myzone.example.net.] AXFR, incoming, remote 2001:db8::2@53, finished, 0.00 seconds, 1 messages, 870 bytes knotd: info: [myzone.example.net.] refresh, remote 2001:db8::2@53, zone updated, 0.00 seconds, serial none -> 1632501860 when I restart dnsmasq to force a SOA update, the following IXFR fails: knotd: info: [myzone.example.net.] refresh, remote 2001:db8::2@53, remote serial 1632503553, zone is outdated knotd: warning: [myzone.example.net.] IXFR, incoming, remote 2001:db8::2@53, malformed response SOA knotd: warning: [myzone.example.net.] refresh, remote myremote-dnsmasq not usable knotd: error: [myzone.example.net.] refresh, failed (no usable master) I've been sitting on this bug for a while now since I couldn't make heads or tails of what the code is doing but my third try at debugging this finally worked out. If we look at the fallback logic in src/knot/events/handlers/refresh.c, in try_refresh(), it revolves around this kind of gnarly while statement: // while loop runs 0x or 1x; IXFR to AXFR failover while (ret = knot_requestor_exec(&requestor, req, timeout), ret = (data.ret == KNOT_EOK ? ret : data.ret), ixfr_error_failover(ret) && data.xfr_type == XFR_TYPE_IXFR && data.state != STATE_SOA_QUERY) { If the while loop is taken fallback to AXFR happens otherwise not. Attaching gdb to my running knot I can see that in my failure case all conditionals except `data.xfr_type == _IXFR` are true but xfr_type is in fact not _IXFR but _ERROR. As far as I can tell the only reason for this check is to terminate the loop once the fallback to AXFR has happened as `_AXFR != _IXFR`. I shall elaborate on this below. So where does xfr_type become _ERROR? In my case determine_xfr_type() returns _ERROR due to the first check, `answer->count < 1` and ixfr_consume assigns this to xfr_type which in turn happens as part of the knot_requestor_exec() call in the while. So how does the fallback usually happen then? Well ixfr_consume checks the RCODE and fails immediately, without modifying xfr_type that is, if it's no good. That way the while statement will run the fallback code. // Check RCODE if (knot_pkt_ext_rcode(pkt) != KNOT_RCODE_NOERROR) { [...] return KNOT_STATE_FAIL; } I'm not really familliar with the DNS RFCs so I'm not sure if dnsmasq is supposed to repond with an error to unsupported query types but from some quick experiments against prominent authorative DNS servers around the internet it does seem like it's behaviour, reponding with NOERROR, an empty answer section and a SOA in authority is what everyone else does too when they get an rrtype they don't care for. Why now would the fallback logic not want to fallback to AXFR in this case? I can't think of a reason it would really want to do that. Like I said above it seems to me the only reason for the _IXFR check in the while is to make sure the loop terminates. Though IMO this is a very roundabout way of doing that and just using an if would be cleaner to boot. Anyway, let's do some git archeology on that point. The first related commit I can find is commit 39e447e4590c7d5fc12a5c05ed11a45fd6561dda Author: Libor Peltan <libch...@gmail.com> Date: Wed Oct 17 10:08:53 2018 +0200 ixfr/axfr: failover even when the ixfr reply wasn't valid before: ixfr falls back to axfr only if valid, but negative reply, e.g. NOTAUTH, journal inconsistency.. after: ixfr falls back to axfr in any case ixfr fails, e.g. malformed reply packet This is the commit that introduced the while statement for the fallback in try_refresh. It looked like so back then: // while loop runs 0x or 1x; IXFR to AXFR failover while ((ret = knot_requestor_exec(&requestor, req, timeout)) != KNOT_EOK && data.xfr_type == XFR_TYPE_IXFR) { Much simpler, if knot_requestor_exec returns a failure and xfr_type is _IXFR the fallback to AXFR happens. Now what I don't understand is that from the commit message this fix sounds just like my issue. Dnsmasq sends back what knot considers an invalid reply so why didn't this fix my problem too? Weird. Looking further I find commit 5c1a5e28797f9c2eab1300247c052451ccd3be3e Author: Libor Peltan <libor.pel...@nic.cz> Date: Tue Jan 31 18:07:28 2017 +0100 XFR: proper handling of single-SOA first response message which introduced determine_xfr_type and the XFR_TYPE_ERROR distinction. If I build the commit just before that one a8237b1f8 xfr: separated functions consuming one rrset of [ai]xfr response and setup a config to reproduce my setup, then XFR against dnsmasq actually starts to work. Back at 5c1a5e287 things are broken with the "malformed SOA" message again. So we likely found the culprit. Ok so back at the bad commit 5c1a5e287 // IXFR to AXFR failover - if (data->is_ixfr && next == KNOT_STATE_FAIL) { + if (data->xfr_type == XFR_TYPE_IXFR && next == KNOT_STATE_FAIL) { This is what introduced the `xfr_type == _IXFR` check which later got moved from this if statement in transfer_consume() to the while loop in try_refresh() by 39e447e45. The check used to be just for is_ixfr which got initialized in refresh_begin to start with (in the good commit 5c1a5e287~1): if (data->soa) { data->state = STATE_SOA_QUERY; data->is_ixfr = true; } else { in bad commit 5c1a5e287: if (data->soa) { data->state = STATE_SOA_QUERY; data->xfr_type = XFR_TYPE_IXFR; data->initial_soa_copy = NULL; } else { and then got reset to false by the fallback code. So I'm pretty convinced at this point that 5c1a5e287 just messed up that logic and the specificity of the XFR_TYPE_IXFR check really is unintended. To fix this we simply replace that check with a counter that ensures the loop will run at most twice like so: --- a/src/knot/events/handlers/refresh.c +++ b/src/knot/events/handlers/refresh.c @@ -1188,12 +1188,12 @@ static int try_refresh(conf_t *conf, zone_t *zone, const conf_remote_t *master, int timeout = conf->cache.srv_tcp_remote_io_timeout; - int ret; + int ret, i; // while loop runs 0x or 1x; IXFR to AXFR failover while (ret = knot_requestor_exec(&requestor, req, timeout), ret = (data.ret == KNOT_EOK ? ret : data.ret), - ixfr_error_failover(ret) && data.xfr_type == XFR_TYPE_IXFR && + ixfr_error_failover(ret) && i++ <= 1 && data.state != STATE_SOA_QUERY) { REFRESH_LOG(LOG_WARNING, data.zone->name, data.remote, "fallback to AXFR (%s)", knot_strerror(ret)); --- Thoughts? Am I missing something this check is doing? --Daniel -- https://lists.nic.cz/mailman/listinfo/knot-dns-users