Hi all, I am seeing a strange issue with a large amount of zone transfers that are happening in a short amount of time.
I am running 2 pdns servers, one as a super master and one as a super slave. Both servers are only configured to use the Bind backend. When I have added a large amount of zones to the master and the notify messages get sent out to the slave (which is expected) to notify the slave of the new zones. When there are a large amount of notify messages the slave server seems to get the same zone inserted multiple times to the slave configuration due to what appears to be a race as the master resends the notify if it doesn't get an acknowledgement back. On the master, for the notify for a domain the following messages are logged: Feb 22 16:08:16 dns1 pdns[17432]: Notification request to host 192.168.1.1 for domain 'test.com' received Feb 22 16:08:25 dns1 pdns[17432]: Removed from notification list: 'test.com' to 192.168.1.1 (was acknowledged) Feb 22 16:08:27 dns1 pdns[17432]: Approved AXFR of 'test.com' from recently notified slave 192.168.1.1 Feb 22 16:08:27 dns1 pdns[17432]: AXFR of domain 'test.com' initiated by 192.168.1.1 Feb 22 16:08:27 dns1 pdns[17432]: AXFR of domain 'test.com' to 192.168.1.1 finished Feb 22 16:08:35 dns1 pdns[17432]: Approved AXFR of 'test.com' from recently notified slave 192.168.1.1 Feb 22 16:08:35 dns1 pdns[17432]: AXFR of domain 'test.com' initiated by 192.168.1.1 Feb 22 16:08:35 dns1 pdns[17432]: AXFR of domain 'test.com' to 192.168.1.1 finished Feb 22 16:08:43 dns1 pdns[17432]: Approved AXFR of 'test.com' from recently notified slave 192.168.1.1 Feb 22 16:08:43 dns1 pdns[17432]: AXFR of domain 'test.com' initiated by 192.168.1.1 Feb 22 16:08:43 dns1 pdns[17432]: AXFR of domain 'test.com' to 192.168.1.1 finished On the slave server, the following messages are logged: Feb 22 16:08:16 dns2 pdns[24071]: Received NOTIFY for test.com from 192.168.1.10 for which we are not authoritative Feb 22 16:08:16 dns2 pdns[24071]: [bindbackend] Writing bind config zone statement for superslave zone 'test.com' from supermaster 192.168.1.10 Feb 22 16:08:16 dns2 pdns[24071]: Created new slave zone 'test.com' from supermaster 192.168.1.10, queued axfr Feb 22 16:08:20 dns2 pdns[24071]: Received NOTIFY for test.com from 192.168.1.10 for which we are not authoritative Feb 22 16:08:20 dns2 pdns[24071]: [bindbackend] Writing bind config zone statement for superslave zone 'test.com' from supermaster 192.168.1.10 Feb 22 16:08:20 dns2 pdns[24071]: Created new slave zone 'test.com' from supermaster 192.168.1.10, queued axfr Feb 22 16:08:25 dns2 pdns[24071]: Received NOTIFY for test.com from 192.168.1.10 for which we are not authoritative Feb 22 16:08:25 dns2 pdns[24071]: [bindbackend] Writing bind config zone statement for superslave zone 'test.com' from supermaster 192.168.1.10 Feb 22 16:08:25 dns2 pdns[24071]: Created new slave zone 'test.com' from supermaster 192.168.1.10, queued axfr Feb 22 16:08:27 dns2 pdns[24071]: Initiating transfer of 'test.com' from remote '192.168.1.10' Feb 22 16:08:27 dns2 pdns[24071]: Can't determine backend for domain 'test.com' Feb 22 16:08:35 dns2 pdns[24071]: Initiating transfer of 'test.com' from remote '192.168.1.10' Feb 22 16:08:35 dns2 pdns[24071]: Can't determine backend for domain 'test.com' Feb 22 16:08:43 dns2 pdns[24071]: Initiating transfer of 'test.com' from remote '192.168.1.10' Feb 22 16:08:43 dns2 pdns[24071]: Can't determine backend for domain 'test.com' The slave configuration file ends up with these comments for the zone that is inserted multiple times, which seems to match up with when the notify messages were received: # Superslave zone test.com (added: Tue Feb 22 16:08:16 2011) (account: ) # Superslave zone test.com (added: Tue Feb 22 16:08:20 2011) (account: ) # Superslave zone test.com (added: Tue Feb 22 16:08:25 2011) (account: ) I am not 100% sure about this, but the way I read the logs is that the notification request is sent from the master at 16:08:16 and the slaves gets it straight away and starts writing out the config. As the acknowledgement doesn't arrive until a few seconds later (16:08:25), in that 9 second gap there is a further two notifications sent as there was no reply in some sort of time period. I also noticed that the zone file doesn't actually get written out on the slave server either. Running a 'pdns_control rediscover' or 'pdns_control bind-reload-now test.com' doesn't seem to get it to write it out either (the zone is in the slave configuration file). The bind-reload-now command gives an error saying that the zone file doesn't exist, so it also shows up in the reject domains. A restart of the pdns server process seems to correct the issue - the zone transfer seems to actually write out the zone file and DNS queries for the zone then seem to work on the secondary. From the secondary, after a server restart I can see the following relevant message in the log: Feb 22 16:54:19 dns2 pdns[30258]: Received valid NOTIFY for test.com (id=1369) from master 192.168.1.10: 2011020102 > 0 After the above message, the zone file is written out. I also see this message printed every minute or so (the SOA records give a slave refresh time of 24H): Feb 22 17:00:52 dns2 pdns[30540]: Domain test.com is stale, master serial 2011020102, our serial 0 After the zone file is written out, DNS queries for that zone on the secondary also seem to fail (with dig shows recursion requested but denied) but a restart of the process again after the zone file is written out seems to fix this too. These issues seem to happen when I have a large number of zones (each zone has about 10-15 records all up and there are approx 77,000 zones) and at the time there are a lot of notifications flying around the place. I can reproduce the issue by clearing the slave zone file and zones (so that the secondary has no DNS zones configured) and then sending more than 1 notify request to the slave in quick succession using 'pdns_control notify-host' (although above a single notification is being sent but it is happening at the same time as many other notifications). I have tried a few versions of pdns (2.9.22-3 (Ubuntu), 2.9.22-8 (Debian) as well as the latest snapshot for version 3 (compiled)). The configuration for pdns is as follows for the slave: allow-recursion=127.0.0.1 bind-check-interval=60 bind-config=/etc/powerdns/slave.conf bind-supermaster-config=/etc/powerdns/slave.conf bind-supermaster-destdir=/etc/powerdns/zones bind-supermasters=/etc/powerdns/masters.conf cache-ttl=20 config-dir=/etc/powerdns daemon=yes disable-tcp=no guardian=yes launch=bind lazy-recursion=yes local-address=192.168.1.1 local-port=53 master=no max-tcp-connections=30 module-dir=/usr/lib/powerdns query-local-address=192.168.1.1 setgid=pdns setuid=pdns slave=yes socket-dir=/var/run I am happy to test anything if anyone has any suggestions to what could be causing the zones to be doubled up. Thanks _______________________________________________ Pdns-users mailing list [email protected] http://mailman.powerdns.com/mailman/listinfo/pdns-users
