We had some problems this morning when one of our zone masters was offline for a planned outage. Since we have redundant masters and this was not even the primary master, I was not concerned and decided to just let the DNS take care of itself. The whole point of multiple masters is to not rely on having all of them available for things to hum along.
But things didn't quite hum along. I noticed that dynamic updates to one zone were not finding their way back to the slave servers. The updates got passed to the primary master. It updated its copy of the zone accordingly and incremented the SOA serial. It sent the NOTIFY messages which were acknowledged by the slave. But the slave would not actually transfer the zone. When I ran a "rndc status" I could see there were two hundred to three hundred "xfers deferred." Sniffing the network, I could see the slave trying to reach the unreachable master, but easily communicating with the available one. So what happens is that it appears that I had such a backup of transfers, that the new ones triggered by the updates and NOTIFYs were being placed at the end of the queue. The problem was that even though there was one master available, every zone check would try both masters and have to wait for a UDP and TCP timeout on the second master before giving up. This was taking f-o-r-e-v-e-r. We have quite a few zones, 301, but not "a lot" by many standards. Is that how things are supposed to work? That doesn't seem like a very robust scheme for handling the possibility of a down master. Or have we misconfigured something to get this problem? BTW, I was playing with temporarily adding "transfers-in" and "transfers-per-ns" statements to speed things up when the other server came back online. Once it did, things cleared up very quickly.
