Hello all,

We're having some odd intermittent problems with our recursor which I'm not sure if I should be concerned or not about them. It seems that intermittently when we query our recursors for a CNAME record, we're not getting a proper response. I am going to be detailed about the problem, so this will be a long message, and I apologize in advance for that. However, I've about reached my wits end with trying to diagnose this issue.

The problem began when we started getting reports from our clients that intermittently their CSS files were not loading. CSS files are stored with static images on the Edgecast and Level 3 CDN systems, and troubleshooting the chain led us to doing a bunch of DNS tests, and that's where things
started getting suspicious.

We're running 6 recursors, all behind a Foundry load balancer, with virtual IP's funneling traffic from on-site machines to the recursors. All recursors are running the x86_64 RPM of pdns-recursor 3.3 downloaded directly from the web site, and the OS is CentOS 5.x. Until now, we haven't seen any issues with this setup, and it's been in production for over 3 years.

Edgecast/Level3 have us setup CDN by creating a CNAME record which points at their systems - i.e.

 cdn.domain.com 43200 IN CNAME wpc.1737.edgecastcdn.net.

As part of our troubleshooting, we set up a number of checks within our nagios monitoring software to monitor the resolution of these entries. By use of the nagios "check_dig" plugin, we are able to do resolution checks against all 6 of our DNS servers once per minute. Essentially, we have
the plugin running these commands every minute:

 dig @{nameserver-ip} any cdn.domain.com
 dig @{nameserver-ip} a cdn.domain.com
 dig @{nameserver-ip} cname cdn.domain.com

With these tests in place and firing off every minute, we see intermittent failures (No ANSWER SECTION found) when querying our recursor for A or ANY, never for CNAME. When a check fails, on the next check one minute later, it passes. We have a couple of machines that run their own BIND caching nameserver, performing the same tests on them show no issues. Also as a test, we set up a dummy record with a CNAME to host on a totally separate, lightly used authoritative server, and those tests have never shown failures either.

The failures appear to be totally random - you might see 2 or 3 failures within 15 minutes, and then you might not see another failure for over an
hour.

The syslogs for the recursors also show nothing out of the ordinary.

Right now, I am working under the thought that occasionally, the recursor does not get a timely response from the Edgecast/Level3 authoritative servers, and is therefore failing. However, it does seem odd that I wouldnt' see the problem with our standalone BIND servers. One other thing I have done for testing is to disable load-balanced traffic to one of our 6 nameservers, and turned on the recursor trace mode on that nameserver. However, even with only a few checks every minute addressed to it, piecing together the trace logs is still not real easy.

Does anyone else have any thoughts on this?

Thanks for any assistance you can give me!

Jeremy Utley

_______________________________________________
Pdns-users mailing list
[email protected]
http://mailman.powerdns.com/mailman/listinfo/pdns-users

Reply via email to