[knot-dns-users] Fixing broken AXFR zone transfer with knot slave and dnsmasq as master

Daniel Gröber Sat, 25 Sep 2021 22:43:47 -0700

Hi knot-dns-users,

I have knotd as a slave for a dnsmasq and I'm seeing fallback from IXFR to
AXFR not working properly. TLDR: I have patches and while I'm waiting for
my gitlab access to be allowed to submit MRs I wrote this email :)


What happens is that the first AXFR goes through and everything looks happy
dandy, but if I restart dnsmasq to force a SOA update knot tries to do an
IXFR which fails and doesn't fall back to AXFR.

First the AXFR works:

    knotd: info: [myzone.example.net.] AXFR, incoming, remote 2001:db8::2@53, 
started
    knotd: [myzone.example.net.] AXFR, incoming, remote 2001:db8::2@53, 
finished, 0.00 seconds, 1 messages, 870 bytes
    knotd: info: [myzone.example.net.] refresh, remote 2001:db8::2@53, zone 
updated, 0.00 seconds, serial none -> 1632501860

when I restart dnsmasq to force a SOA update, the following IXFR fails:

    knotd: info: [myzone.example.net.] refresh, remote 2001:db8::2@53, remote 
serial 1632503553, zone is outdated
    knotd: warning: [myzone.example.net.] IXFR, incoming, remote 
2001:db8::2@53, malformed response SOA
    knotd: warning: [myzone.example.net.] refresh, remote myremote-dnsmasq not 
usable
    knotd: error: [myzone.example.net.] refresh, failed (no usable master)

I've been sitting on this bug for a while now since I couldn't make heads
or tails of what the code is doing but my third try at debugging this
finally worked out.

If we look at the fallback logic in src/knot/events/handlers/refresh.c, in
try_refresh(), it revolves around this kind of gnarly while statement:

        // while loop runs 0x or 1x; IXFR to AXFR failover
        while (ret = knot_requestor_exec(&requestor, req, timeout),
               ret = (data.ret == KNOT_EOK ? ret : data.ret),
               ixfr_error_failover(ret) && data.xfr_type == XFR_TYPE_IXFR &&
               data.state != STATE_SOA_QUERY) {

If the while loop is taken fallback to AXFR happens otherwise not.

Attaching gdb to my running knot I can see that in my failure case all
conditionals except `data.xfr_type == _IXFR` are true but xfr_type is in
fact not _IXFR but _ERROR. As far as I can tell the only reason for this
check is to terminate the loop once the fallback to AXFR has happened as
`_AXFR != _IXFR`. I shall elaborate on this below.

So where does xfr_type become _ERROR? In my case determine_xfr_type()
returns _ERROR due to the first check, `answer->count < 1` and ixfr_consume
assigns this to xfr_type which in turn happens as part of the
knot_requestor_exec() call in the while.

So how does the fallback usually happen then? Well ixfr_consume checks the
RCODE and fails immediately, without modifying xfr_type that is, if it's no
good. That way the while statement will run the fallback code.

    // Check RCODE
    if (knot_pkt_ext_rcode(pkt) != KNOT_RCODE_NOERROR) {
        [...]
        return KNOT_STATE_FAIL;
    }

I'm not really familliar with the DNS RFCs so I'm not sure if dnsmasq is
supposed to repond with an error to unsupported query types but from some
quick experiments against prominent authorative DNS servers around the
internet it does seem like it's behaviour, reponding with NOERROR, an empty
answer section and a SOA in authority is what everyone else does too when
they get an rrtype they don't care for.

Why now would the fallback logic not want to fallback to AXFR in this case?
I can't think of a reason it would really want to do that.  Like I said
above it seems to me the only reason for the _IXFR check in the while is to
make sure the loop terminates. Though IMO this is a very roundabout way of
doing that and just using an if would be cleaner to boot.

Anyway, let's do some git archeology on that point. The first related
commit I can find is

    commit 39e447e4590c7d5fc12a5c05ed11a45fd6561dda
    Author: Libor Peltan <libch...@gmail.com>
    Date:   Wed Oct 17 10:08:53 2018 +0200
    
        ixfr/axfr: failover even when the ixfr reply wasn't valid
        
        before: ixfr falls back to axfr only if valid, but negative reply, e.g. 
NOTAUTH, journal inconsistency..
        after: ixfr falls back to axfr in any case ixfr fails, e.g. malformed 
reply packet

This is the commit that introduced the while statement for the fallback in
try_refresh. It looked like so back then:

    // while loop runs 0x or 1x; IXFR to AXFR failover
    while ((ret = knot_requestor_exec(&requestor, req, timeout)) != KNOT_EOK &&
           data.xfr_type == XFR_TYPE_IXFR) {

Much simpler, if knot_requestor_exec returns a failure and xfr_type is
_IXFR the fallback to AXFR happens. Now what I don't understand is that
from the commit message this fix sounds just like my issue. Dnsmasq sends
back what knot considers an invalid reply so why didn't this fix my problem
too? Weird.

Looking further I find

    commit 5c1a5e28797f9c2eab1300247c052451ccd3be3e
    Author: Libor Peltan <libor.pel...@nic.cz>
    Date:   Tue Jan 31 18:07:28 2017 +0100
    
        XFR: proper handling of single-SOA first response message

which introduced determine_xfr_type and the XFR_TYPE_ERROR distinction. If
I build the commit just before that one

    a8237b1f8 xfr: separated functions consuming one rrset of [ai]xfr response

and setup a config to reproduce my setup, then XFR against dnsmasq actually
starts to work. Back at 5c1a5e287 things are broken with the "malformed
SOA" message again. So we likely found the culprit.

Ok so back at the bad commit 5c1a5e287

      // IXFR to AXFR failover
    - if (data->is_ixfr && next == KNOT_STATE_FAIL) {
    + if (data->xfr_type == XFR_TYPE_IXFR && next == KNOT_STATE_FAIL) {

This is what introduced the `xfr_type == _IXFR` check which later got moved
from this if statement in transfer_consume() to the while loop in
try_refresh() by 39e447e45. The check used to be just for is_ixfr which got
initialized in refresh_begin to start with (in the good commit 5c1a5e287~1):

    if (data->soa) {
        data->state = STATE_SOA_QUERY;
        data->is_ixfr = true;
    } else {

in bad commit 5c1a5e287:
    if (data->soa) {
        data->state = STATE_SOA_QUERY;
        data->xfr_type = XFR_TYPE_IXFR;
        data->initial_soa_copy = NULL;
    } else {

and then got reset to false by the fallback code.

So I'm pretty convinced at this point that 5c1a5e287 just messed up that
logic and the specificity of the XFR_TYPE_IXFR check really is unintended.

To fix this we simply replace that check with a counter that ensures the
loop will run at most twice like so:

--- a/src/knot/events/handlers/refresh.c
+++ b/src/knot/events/handlers/refresh.c
@@ -1188,12 +1188,12 @@ static int try_refresh(conf_t *conf, zone_t *zone, 
const conf_remote_t *master,
 
        int timeout = conf->cache.srv_tcp_remote_io_timeout;
 
-       int ret;
+       int ret, i;
 
        // while loop runs 0x or 1x; IXFR to AXFR failover
        while (ret = knot_requestor_exec(&requestor, req, timeout),
               ret = (data.ret == KNOT_EOK ? ret : data.ret),
-              ixfr_error_failover(ret) && data.xfr_type == XFR_TYPE_IXFR &&
+              ixfr_error_failover(ret) && i++ <= 1 &&
               data.state != STATE_SOA_QUERY) {
                REFRESH_LOG(LOG_WARNING, data.zone->name, data.remote,
                            "fallback to AXFR (%s)", knot_strerror(ret));
---

Thoughts? Am I missing something this check is doing?

--Daniel
-- 
https://lists.nic.cz/mailman/listinfo/knot-dns-users

[knot-dns-users] Fixing broken AXFR zone transfer with knot slave and dnsmasq as master

Reply via email to