[Pdns-users] Possible AXFR Race Condition

p8x Tue, 22 Feb 2011 01:23:20 -0800

Hi all,

I am seeing a strange issue with a large amount of zone transfers that
are happening in a short amount of time.


I am running 2 pdns servers, one as a super master and one as a super
slave. Both servers are only configured to use the Bind backend. When I
have added a large amount of zones to the master and the notify messages
get sent out to the slave (which is expected) to notify the slave of the
new zones. When there are a large amount of notify messages the slave
server seems to get the same zone inserted multiple times to the slave
configuration due to what appears to be a race as the master resends the
notify if it doesn't get an acknowledgement back.

On the master, for the notify for a domain the following messages are
logged:

Feb 22 16:08:16 dns1 pdns[17432]: Notification request to host
192.168.1.1 for domain 'test.com' received
Feb 22 16:08:25 dns1 pdns[17432]: Removed from notification list:
'test.com' to 192.168.1.1 (was acknowledged)
Feb 22 16:08:27 dns1 pdns[17432]: Approved AXFR of 'test.com' from
recently notified slave 192.168.1.1
Feb 22 16:08:27 dns1 pdns[17432]: AXFR of domain 'test.com' initiated by
192.168.1.1
Feb 22 16:08:27 dns1 pdns[17432]: AXFR of domain 'test.com' to
192.168.1.1 finished
Feb 22 16:08:35 dns1 pdns[17432]: Approved AXFR of 'test.com' from
recently notified slave 192.168.1.1
Feb 22 16:08:35 dns1 pdns[17432]: AXFR of domain 'test.com' initiated by
192.168.1.1
Feb 22 16:08:35 dns1 pdns[17432]: AXFR of domain 'test.com' to
192.168.1.1 finished
Feb 22 16:08:43 dns1 pdns[17432]: Approved AXFR of 'test.com' from
recently notified slave 192.168.1.1
Feb 22 16:08:43 dns1 pdns[17432]: AXFR of domain 'test.com' initiated by
192.168.1.1
Feb 22 16:08:43 dns1 pdns[17432]: AXFR of domain 'test.com' to
192.168.1.1 finished

On the slave server, the following messages are logged:

Feb 22 16:08:16 dns2 pdns[24071]: Received NOTIFY for test.com from
192.168.1.10 for which we are not authoritative
Feb 22 16:08:16 dns2 pdns[24071]: [bindbackend] Writing bind config zone
statement for superslave zone 'test.com' from supermaster 192.168.1.10
Feb 22 16:08:16 dns2 pdns[24071]: Created new slave zone 'test.com' from
supermaster 192.168.1.10, queued axfr
Feb 22 16:08:20 dns2 pdns[24071]: Received NOTIFY for test.com from
192.168.1.10 for which we are not authoritative
Feb 22 16:08:20 dns2 pdns[24071]: [bindbackend] Writing bind config zone
statement for superslave zone 'test.com' from supermaster 192.168.1.10
Feb 22 16:08:20 dns2 pdns[24071]: Created new slave zone 'test.com' from
supermaster 192.168.1.10, queued axfr
Feb 22 16:08:25 dns2 pdns[24071]: Received NOTIFY for test.com from
192.168.1.10 for which we are not authoritative
Feb 22 16:08:25 dns2 pdns[24071]: [bindbackend] Writing bind config zone
statement for superslave zone 'test.com' from supermaster 192.168.1.10
Feb 22 16:08:25 dns2 pdns[24071]: Created new slave zone 'test.com' from
supermaster 192.168.1.10, queued axfr
Feb 22 16:08:27 dns2 pdns[24071]: Initiating transfer of 'test.com' from
remote '192.168.1.10'
Feb 22 16:08:27 dns2 pdns[24071]: Can't determine backend for domain
'test.com'
Feb 22 16:08:35 dns2 pdns[24071]: Initiating transfer of 'test.com' from
remote '192.168.1.10'
Feb 22 16:08:35 dns2 pdns[24071]: Can't determine backend for domain
'test.com'
Feb 22 16:08:43 dns2 pdns[24071]: Initiating transfer of 'test.com' from
remote '192.168.1.10'
Feb 22 16:08:43 dns2 pdns[24071]: Can't determine backend for domain
'test.com'

The slave configuration file ends up with these comments for the zone
that is inserted multiple times, which seems to match up with when the
notify messages were received:

# Superslave zone test.com (added: Tue Feb 22 16:08:16 2011) (account: )
# Superslave zone test.com (added: Tue Feb 22 16:08:20 2011) (account: )
# Superslave zone test.com (added: Tue Feb 22 16:08:25 2011) (account: )

I am not 100% sure about this, but the way I read the logs is that the
notification request is sent from the master at 16:08:16 and the slaves
gets it straight away and starts writing out the config. As the
acknowledgement doesn't arrive until a few seconds later (16:08:25), in
that 9 second gap there is a further two notifications sent as there was
no reply in some sort of time period.

I also noticed that the zone file doesn't actually get written out on
the slave server either. Running a 'pdns_control rediscover' or
'pdns_control bind-reload-now test.com' doesn't seem to get it to write
it out either (the zone is in the slave configuration file). The
bind-reload-now command gives an error saying that the zone file doesn't
exist, so it also shows up in the reject domains. A restart of the pdns
server process seems to correct the issue - the zone transfer seems to
actually write out the zone file and DNS queries for the zone then seem
to work on the secondary. From the secondary, after a server restart I
can see the following relevant message in the log:

Feb 22 16:54:19 dns2 pdns[30258]: Received valid NOTIFY for test.com
(id=1369) from master 192.168.1.10: 2011020102 > 0

After the above message, the zone file is written out.

I also see this message printed every minute or so (the SOA records give
a slave refresh time of 24H):

Feb 22 17:00:52 dns2 pdns[30540]: Domain test.com is stale, master
serial 2011020102, our serial 0

After the zone file is written out, DNS queries for that zone on the
secondary also seem to fail (with dig shows recursion requested but
denied) but a restart of the process again after the zone file is
written out seems to fix this too.

These issues seem to happen when I have a large number of zones (each
zone has about 10-15 records all up and there are approx 77,000 zones)
and at the time there are a lot of notifications flying around the
place. I can reproduce the issue by clearing the slave zone file and
zones (so that the secondary has no DNS zones configured) and then
sending more than 1 notify request to the slave in quick succession
using 'pdns_control notify-host' (although above a single notification
is being sent but it is happening at the same time as many other
notifications).

I have tried a few versions of pdns (2.9.22-3 (Ubuntu), 2.9.22-8
(Debian) as well as the latest snapshot for version 3 (compiled)).

The configuration for pdns is as follows for the slave:

allow-recursion=127.0.0.1
bind-check-interval=60
bind-config=/etc/powerdns/slave.conf
bind-supermaster-config=/etc/powerdns/slave.conf
bind-supermaster-destdir=/etc/powerdns/zones
bind-supermasters=/etc/powerdns/masters.conf
cache-ttl=20
config-dir=/etc/powerdns
daemon=yes
disable-tcp=no
guardian=yes
launch=bind
lazy-recursion=yes
local-address=192.168.1.1
local-port=53
master=no
max-tcp-connections=30
module-dir=/usr/lib/powerdns
query-local-address=192.168.1.1
setgid=pdns
setuid=pdns
slave=yes
socket-dir=/var/run

I am happy to test anything if anyone has any suggestions to what could
be causing the zones to be doubled up.

Thanks


_______________________________________________
Pdns-users mailing list
[email protected]
http://mailman.powerdns.com/mailman/listinfo/pdns-users

[Pdns-users] Possible AXFR Race Condition

Reply via email to