<sigh> It turns out I made a couple stupid mistakes that worked together to make me look like a fool. :-) When first testing this problem, I was using dig @[my-server-ip], but after updating my recursor to 3.1.2, I changed over to using dig @localhost - but I only tested the problem domains after the 3.1.2 update, and didn't test any "known good" domains (like google.com, etc.). If I had tested the "known good" domains @localhost, I would have discovered (earlier than this afternoon, and probably before sending out my first description of the "problem") that I had neglected to include 127.0.0.1 in the "allowed-recursion" section of pdns.conf, and that ALL recursive queries would fail. Once I fixed this oversight, things began to work as expected.
My apologies for wasting everyone's time here, and thanks to Bert and Darren for their help. My only defense is that I am recently back from a vacation, and can only assume that my brain was left behind somewhere. :-) Kirk -----Original Message----- From: Darren Gamble [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 23, 2006 9:55 AM To: Kirk Friggstad; [email protected] Subject: RE: [Pdns-users] RE: Recursion failing on certain records? Hi Kirk, > I think I maybe went a little too complicated on my explanation here, and > missed a couple semi-crucial details. We are *not* the authoritative > servers > for the two domains that are giving problems (acegroup.cc and > hivelocity.net). No, I understood this. I had simply pointed out that when you said you queried the "authoritative server", you were actually querying your cache. Note the "@localhost" in your query. You should have queried for the NS records for the domain, and directed your queries there. > Darren - you said that there was something in the configuration of these > two > domains that would fail in pre-3.2.1 versions of the recursor. Can you > give > a bit more detail on that? The whole explanation is a bit complicated, but basically because the SLD servers have different NS records with different TTLs than the authoritative servers, older pdns servers will mash the two record sets together into a single name, with different TTLs. This is verboten by RFC, and causes a problem- it causes the record to change as names with lower TTLs expire. If this leaves with you with only NS record(s) that don't respond- which is the issue here- then the recursor is not able to look up names on that domain anymore. This is also why the problem is intermittent- when all of the records expire, the cache goes back up to the SLD servers and the process starts over again. The zone is definitely not configured as it should be, but it should still work. They shouldn't have registered servers that aren't responding. In 3.1.2+, pdns will just replace the record set with the authoritative server's. This fixes the problem, and is consistant with other caching software. Again, if you were to post the NS records for the domain on each server when you have this problem, this will tell you for certain if this is the case. This should be Step 1 of troubleshooting any DNS issue like this. > My question is - is there something in the code for pdns_server (2.9.20 > and > 2.9.21 snapshot at least), This zone, and many others like it, should resolve properly if you upgrade all of your caches to at least 3.1.2. That is what you need to do. 3.1.3 should be released soon too, which contains some important crash fixes too. ============================ Darren Gamble Planner, Regional Services Shaw Cablesystems GP 630 - 3rd Avenue SW Calgary, Alberta, Canada T2P 4L4 (403) 781-4948 > > -----Original Message----- > From: Kirk Friggstad [mailto:[EMAIL PROTECTED] > Sent: Tuesday, August 22, 2006 11:52 AM > To: '[email protected]' > Subject: Recursion failing on certain records? > > Greetings all: > > I've been puzzling through some strangeness in our PowerDNS installations > here. Recursive queries for certain records/domains have been failing > consistently for a number of weeks - two examples are: > mail.acegroup.cc > mail.hivelocity.net > > If I query the authoritative server, I get a SERVFAIL: > $ dig @localhost mail.hivelocity.net > ; <<>> DiG 9.2.4 <<>> @localhost mail.hivelocity.net > ; (1 server found) > ;; global options: printcmd > ;; Got answer: > ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49833 > ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 > > ;; QUESTION SECTION: > ;mail.hivelocity.net. IN A > > ;; Query time: 1 msec > ;; SERVER: 127.0.0.1#53(127.0.0.1) > ;; WHEN: Tue Aug 22 11:10:21 2006 > ;; MSG SIZE rcvd: 37 > > but if I query the 3.1.2 recursor directly, I get the correct answer: > $ dig @localhost -p 4754 mail.hivelocity.net > ; <<>> DiG 9.2.4 <<>> @localhost -p 4754 mail.hivelocity.net > ; (1 server found) > ;; global options: printcmd > ;; Got answer: > ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1932 > ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 > > ;; QUESTION SECTION: > ;mail.hivelocity.net. IN A > > ;; ANSWER SECTION: > mail.hivelocity.net. 300 IN A 66.96.80.16 > > ;; Query time: 206 msec > ;; SERVER: 127.0.0.1#4754(127.0.0.1) > ;; WHEN: Tue Aug 22 11:10:04 2006 > ;; MSG SIZE rcvd: 53 > > Querying a 2.9.20 recursor directly returns a SERVFAIL. > > Recursive queries for most other domains return correct answers - these > two > domains (acegroup.cc and hivelocity.net) are the only ones that I've come > across that exhibit this behavior. Lookups for those two domains from > http://dnsstuff.com/ appear normal as well. > > I can reproduce this on the following systems: > System 1 - RHEL 3, pdns_server 2.9.20 (static RPM from powerdns.com) > recursing to pdns_recursor 3.1.2 (generic RPM from powerdns.com) > System 2 - RHEL 3, pdns_server 2.9.20 (static RPM from powerdns.com) > recursing to pdns_recursor 2.9.20 (build from source, gcc 4.0.2) > > Both systems have identical configuration files (except for IP address > binding), using the bind backend, and do not appear to exhibit any > problems > with authoritative queries, only recursive. > > Anyone have any suggestions as to what is happening here? Could there be a > bug somewhere in the recursion routines of pdns_server? Am I making some > completely stupid mistake somewhere? I'm out of answers - any help would > be > greatly appreciated. > > Thanks > > Kirk > > _______________________________________________ > Pdns-users mailing list > [email protected] > http://mailman.powerdns.com/mailman/listinfo/pdns-users _______________________________________________ Pdns-users mailing list [email protected] http://mailman.powerdns.com/mailman/listinfo/pdns-users
