Re: Clients get DNS timeouts because ipv6 means more queries for each lookup
No. The fix is to correct the nameservers. They are not correctly following the DNS protocol and everything else is a fall out from that. Well, all the prodding from people here prompted me to investigate further exactly what's going on. The problem isn't what I thought it was. It appears to be a bug in glibc, and I've filed a bug report and found a workaround. There is no bug in glibc. In a nutshell, the getaddrinfo function in glibc sends both A and queries to the DNS server at the same time and then deals with the responses as they come in. Unfortunately, if the responses to the two queries come back in reverse order, /and/ the first one to come back is a server failure, both of which are the case when you try to resolve en.wikipedia.org immediately after restarting your DNS server so nothing is cached, the glibc code screws up and decides it didn't get back a successful response even though it did. There is *nothing* wrong with sending both queries at once. If you do the same lookup again, it works, because the CNAME that was sent in response to the A query is cached, so both the A and queries get back valid responses from the DNS server. And even if that weren't the case, since the CNAME is cached it gets returned first, since the server doesn't need to do a query to get it, whereas it does need to do another query to get the record (which recall isn't being cached because of the previously discussed FORMERR problem). It'll keep working until the cached records time out, at which point it'll happen again, and then be OK again until the records time out, etc. The workaround is to put options single-request in /etc/resolv.conf to prevent the glibc innards from sending out both the A and queries at the same time. FYI, here's the glibc bug I filed about this: http://sourceware.org/bugzilla/show_bug.cgi?id=12994 Thank you for telling me I was full of it and making me dig deeper into this until I located the actual cause of the issue. :-) jik Note your fix won't help clients that only ask for records because it is the authoritative servers that are broken, not the resolver library or the recursive server. Mark -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Clients get DNS timeouts because ipv6 means more queries for each lookup
On 07/13/2011 02:13 AM, Mark Andrews wrote: No. The fix is to correct the nameservers. They are not correctly following the DNS protocol and everything else is a fall out from that. You're right that everything else is fallout from that. But that doesn't do me much good, does it? It's my system that keeps getting bogus name resolution errors. It's my RSS feed reader that keeps failing on an hourly basis when the cached records for en.wikipedia.org expire. It's all very well and good to say that the Wikipedia folks and other people with this problem should fix their nameservers -- I totally agree with that -- but it doesn't help me solve my problem /now/. I'm a real user in the real world with a real problem. Yelling at Wikipedia to fix their DNS servers may feel good, but it doesn't make my DNS work. As far as I and all the other users who are being impacted /now/ by this problem are concerned, it's just pissing into the wind. Well, all the prodding from people here prompted me to investigate further exactly what's going on. The problem isn't what I thought it was. It appears to be a bug in glibc, and I've filed a bug report and found a workaround. There is no bug in glibc. To be blunt, that's bullshit. If glibc makes an A query and an query, and it gets back a valid response to the A query and an invalid response to the query, then it should ignore the invalid response to the query and return the valid A response to the user as the IP address for the host. Please note, furthermore, that as I explained in detail in my bug report and in my last message, glibc behaves differently based on the /order/ in which the two responses are returned by the DNS server. Since there's nothing that says a DNS server has to respond to two queries in the order in which they were received, and that would be an impossible requirement to impose in any case, since the queries and responses are sent via UDP which doesn' guarantee order, it's perfectly clear that glibc needs to be prepared to function the same regardless of the order in which it receives the responses. What's more, there's plenty of code in the glibc files I spent hours poring over which is clearly an attempt to do exactly that. The people who wrote the code just got it wrong. Which isn't surprising, given how god-awful the code is. This is not an either/or situation. The broken nameservers should be fixed, /and/ glibc should be fixed to properly handle the case of when it sends two queries and gets back one valid response and one server error in reverse order. In a nutshell, the getaddrinfo function in glibc sends both A and queries to the DNS server at the same time and then deals with the responses as they come in. Unfortunately, if the responses to the two queries come back in reverse order, /and/ the first one to come back is a server failure, both of which are the case when you try to resolve en.wikipedia.org immediately after restarting your DNS server so nothing is cached, the glibc code screws up and decides it didn't get back a successful response even though it did. There is *nothing* wrong with sending both queries at once. I didn't say there was. You really don't seem to be paying very good attention. Do you understand what the word /workaround/ means? Note your fix won't help clients that only ask for records because it is the authoritative servers that are broken, not the resolver library or the recursive server. I am aware of that. It is irrelevant, because it is not the problem I am trying to solve. I, and 99.99% of the users in the world, are /not/ only ask[ing] for records. Nobody actually trying to use the internet for day-to-day work is doing that right now, because to say that IPv6 support is not yet ubiquitous would be a laughably momentous understatement. You seem to have a really big chip on your shoulder about people who run broken DNS servers. I don't like them any more than you do. But I learned Be generous in what you accept and conservative in what you generate way back when I started playing with the Internet well over two decades ago. It holds up now as well as it did back then, and there's no good reason why it shouldn't apply in this case. It's clear that this is a religious issue for you. I'm not here to debate religion, I'm here to get help making my DNS work, and to help other people, to whatever extent I can, make /their/ DNS work. If you continue to send religious screeds on this topic while making no effort to actually read and understand what I write, please do not expect me to respond further. Jonathan Kamens smime.p7s Description: S/MIME Cryptographic Signature ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
monitoring BIND
We have some nameservers :-) that are used by quite a few thousands of people. Every now and then someone comes to us and complains that the DNS is responding slowly. Sometimes they are right, and we find the problem and fix it. But most of the time everything runs fine, and the DNS is not, in fact, responding slowly when that someone comes to complain. It turns out to be their PC, or a local network issue, or whatever. So we have a homegrown system in place that watches the traffic to and from the nameservers, matches queries to answers, ignores everything else, and notes how long it was between the question going past and the answer going past in the opposite direction. It writes summarised information second by second into a database so we can see exactly when problems with response times happen, how long they happen for, and how bad they are when they happen. Our system has two faults (well, two that we are actually concerned about): It only watches UDP, and it can't deal with fragmented packets. So I was wondering if there is a better solution out there? Regards, K. -- ~~~ Karl Auer (ka...@biplane.com.au) +61-2-64957160 (h) http://www.biplane.com.au/kauer/ +61-428-957160 (mob) GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687 Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156 signature.asc Description: This is a digitally signed message part ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: monitoring BIND
Nagios is a very move tool for synthetic transaction monitoring. You put in whatever hosts and host names to resolve and it does it. -Ben Croswell On Jul 13, 2011 11:01 AM, Karl Auer ka...@biplane.com.au wrote: We have some nameservers :-) that are used by quite a few thousands of people. Every now and then someone comes to us and complains that the DNS is responding slowly. Sometimes they are right, and we find the problem and fix it. But most of the time everything runs fine, and the DNS is not, in fact, responding slowly when that someone comes to complain. It turns out to be their PC, or a local network issue, or whatever. So we have a homegrown system in place that watches the traffic to and from the nameservers, matches queries to answers, ignores everything else, and notes how long it was between the question going past and the answer going past in the opposite direction. It writes summarised information second by second into a database so we can see exactly when problems with response times happen, how long they happen for, and how bad they are when they happen. Our system has two faults (well, two that we are actually concerned about): It only watches UDP, and it can't deal with fragmented packets. So I was wondering if there is a better solution out there? Regards, K. -- ~~~ Karl Auer (ka...@biplane.com.au) +61-2-64957160 (h) http://www.biplane.com.au/kauer/ +61-428-957160 (mob) GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687 Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156 ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
big improvement in BIND9 auth-server startup time
People who operate big authoritative name servers (particularly with large numbers of small zones, e.g., for domain hosting and parking), and have had trouble with slow startup, may find this information useful: http://www.isc.org/community/blog/201107/major-improvement-bind-9-startup-performance -- Evan Hunt -- e...@isc.org Internet Systems Consortium, Inc. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: monitoring BIND
Hi Karl, Have you considered using dig? -Romskie On Wed, Jul 13, 2011 at 10:43 PM, Karl Auer ka...@biplane.com.au wrote: We have some nameservers :-) that are used by quite a few thousands of people. Every now and then someone comes to us and complains that the DNS is responding slowly. Sometimes they are right, and we find the problem and fix it. But most of the time everything runs fine, and the DNS is not, in fact, responding slowly when that someone comes to complain. It turns out to be their PC, or a local network issue, or whatever. So we have a homegrown system in place that watches the traffic to and from the nameservers, matches queries to answers, ignores everything else, and notes how long it was between the question going past and the answer going past in the opposite direction. It writes summarised information second by second into a database so we can see exactly when problems with response times happen, how long they happen for, and how bad they are when they happen. Our system has two faults (well, two that we are actually concerned about): It only watches UDP, and it can't deal with fragmented packets. So I was wondering if there is a better solution out there? Regards, K. -- ~~~ Karl Auer (ka...@biplane.com.au) +61-2-64957160 (h) http://www.biplane.com.au/kauer/ +61-428-957160 (mob) GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687 Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156 ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: monitoring BIND
More info to my question: dig and Nagios have been suggested as possible solutions. dig (and I suspect Nagios, which someone else mentioned) can only test resolution times from one point in the network, or maybe several, and using a very small number of tests. Our current system watches ALL queries and responses to and from the nameservers and summarises ALL the response times, regardless of where the queries came from. For every second of the day we can say what the average, minimum, maximum, etc response times were. We're looking for something that can do that, or something similar... Regards, K. -- ~~~ Karl Auer (ka...@biplane.com.au) +61-2-64957160 (h) http://www.biplane.com.au/kauer/ +61-428-957160 (mob) GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687 Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156 signature.asc Description: This is a digitally signed message part ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: monitoring BIND
You can use dig to get a sample of the response time and rndc stats to get query and nameserver statistics. On Wed, Jul 13, 2011 at 11:15 PM, Romskie L rslara...@gmail.com wrote: Hi Karl, Have you considered using dig? -Romskie On Wed, Jul 13, 2011 at 10:43 PM, Karl Auer ka...@biplane.com.au wrote: We have some nameservers :-) that are used by quite a few thousands of people. Every now and then someone comes to us and complains that the DNS is responding slowly. Sometimes they are right, and we find the problem and fix it. But most of the time everything runs fine, and the DNS is not, in fact, responding slowly when that someone comes to complain. It turns out to be their PC, or a local network issue, or whatever. So we have a homegrown system in place that watches the traffic to and from the nameservers, matches queries to answers, ignores everything else, and notes how long it was between the question going past and the answer going past in the opposite direction. It writes summarised information second by second into a database so we can see exactly when problems with response times happen, how long they happen for, and how bad they are when they happen. Our system has two faults (well, two that we are actually concerned about): It only watches UDP, and it can't deal with fragmented packets. So I was wondering if there is a better solution out there? Regards, K. -- ~~~ Karl Auer (ka...@biplane.com.au) +61-2-64957160 (h) http://www.biplane.com.au/kauer/ +61-428-957160 (mob) GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687 Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156 ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: monitoring BIND
On 07/13/2011 03:43 PM, Karl Auer wrote: So I was wondering if there is a better solution out there? People I know speak highly of DSC: http://dns.measurement-factory.com/tools/dsc/index.html ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: monitoring BIND
Sorry for contributing another non-answer, just wanted to comment that I have done something very similar once upon a time... The case was a DNS authority service anycast node with: 2 Internet Facing Routers -- 2 Load Balancing Switches -- Big Stack of Servers We had seen degraded performance reported by RIPE NCC's DNSMON but weren't sure if the problem was Internet routing, or inside our nodes, and if inside our nodes was it the server, or the load balancer, etc. We set up traffic capture with tcpdump at strategic points within the node, ie: between the router and load balancer, between the load balancer and the servers, on each server. With a good sample of the traffic, say an hour or so, we could then pull the DNSMON raw data for that same time period, and match the queries it sent to us (the DNSMON raw data contains the query id) against what we saw inside our node and verify that we saw it, answered it, and that the answer made it back out into the Internet. We could also see what path the query and answer took through the node and where any delays might be. This very quickly led us to the load balancers as the cause of the delays and we were able to fix them. We never felt the need to run this on an ongoing basis, once our servers looked green in DNSMON again we were happy that all was well in our world. We used it for diagnosis, rather than detection as it sounds like you want to do. dave On 2011-07-13, at 11:27 AM, Karl Auer wrote: More info to my question: dig and Nagios have been suggested as possible solutions. dig (and I suspect Nagios, which someone else mentioned) can only test resolution times from one point in the network, or maybe several, and using a very small number of tests. Our current system watches ALL queries and responses to and from the nameservers and summarises ALL the response times, regardless of where the queries came from. For every second of the day we can say what the average, minimum, maximum, etc response times were. We're looking for something that can do that, or something similar... Regards, K. -- ~~~ Karl Auer (ka...@biplane.com.au) +61-2-64957160 (h) http://www.biplane.com.au/kauer/ +61-428-957160 (mob) GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687 Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156 ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Clients get DNS timeouts because ipv6 means more queries for each lookup
On 7/13/2011 2:35 AM, Jonathan Kamens wrote: On 07/13/2011 02:13 AM, Mark Andrews wrote: Well, all the prodding from people here prompted me to investigate further exactly what's going on. The problem isn't what I thought it was. It appears to be a bug in glibc, and I've filed a bug report and found a workaround. There is no bug in glibc. To be blunt, that's bullshit. If glibc makes an A query and an query, and it gets back a valid response to the A query and an invalid response to the query, then it should ignore the invalid response to the query and return the valid A response to the user as the IP address for the host. Please note, furthermore, that as I explained in detail in my bug report and in my last message, glibc behaves differently based on the /order/ in which the two responses are returned by the DNS server. Since there's nothing that says a DNS server has to respond to two queries in the order in which they were received, and that would be an impossible requirement to impose in any case, since the queries and responses are sent via UDP which doesn' guarantee order, it's perfectly clear that glibc needs to be prepared to function the same regardless of the order in which it receives the responses. I agree that the order of the A/ responses shouldn't matter to the result. The whole getaddrinfo() call should fail regardless of whether the failure is seen first or the valid response is seen first. Why? Because getaddrinfo() should, if it isn't already, be using the RFC 3484 algorithm (and/or whatever the successor to RFC 3484 ends up being) to sort the addresses, and for that algorithm to work, one needs *both* the IPv4 address(es) *and* the IPv6 address(es) available, in order to compare their scopes, prefixes, etc.. If one of the lookups fails, and this failure is presented to the RFC 3484 algorithm as NODATA for a particular address family, then the algorithm could make a bad selection of the destination address, and this can lead to other sorts of breakage, e.g. trying to use a tunneled connection where no tunnel exists. The *safe* thing for glibc to do is to promote the failure of either the A lookup or the lookup to a general lookup failure, which prompts the user/administrator to find the source of the problem and fix it. It's rarely a good idea to mask undeniable errors as if there were no error at all. It leads to unpredictable behavior and really tough troubleshooting challenges. I think glibc is erring on the side of openness and transparency here, rather than trying to cover up the fact that something is horribly wrong. Note your fix won't help clients that only ask for records because it is the authoritative servers that are broken, not the resolver library or the recursive server. I am aware of that. It is irrelevant, because it is not the problem I am trying to solve. I, and 99.99% of the users in the world, are /not/ only ask[ing] for records. Nobody actually trying to use the internet for day-to-day work is doing that right now, because to say that IPv6 support is not yet ubiquitous would be a laughably momentous understatement. What about clients in a NAT64/DNS64 environment? They could be configured as IPv6-only but normally able to access the IPv4 Internet just fine. Even with your glibc fix in place, though, they'll presumably break if the authoritative nameservers are giving garbage responses to queries (could someone with practical experience in DNS64 please confirm this?). Another possibility you're not considering is that the invoking application itself may make independent IPv4-specific and IPv6-specific getaddrinfo() lookups. Why would it do this? Why not? Maybe IPv6 capability is something the user has to buy a separate license for, so the IPv6 part is a slightly separate codepath, added in a later version, than the base product, which is IPv4-only. When one of the getaddrinfo() calls returns address records and the other returns garbage, your fix doesn't prevent such an application from doing something unpredictable, possibly catastrophic. So it's really not a general solution to the problem. - Kevin ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Clients get DNS timeouts because ipv6 means more queries for each lookup
On 7/13/2011 1:06 PM, Kevin Darcy wrote: On 7/13/2011 2:35 AM, Jonathan Kamens wrote: On 07/13/2011 02:13 AM, Mark Andrews wrote: Well, all the prodding from people here prompted me to investigate further exactly what's going on. The problem isn't what I thought it was. It appears to be a bug in glibc, and I've filed a bug report and found a workaround. There is no bug in glibc. To be blunt, that's bullshit. If glibc makes an A query and an query, and it gets back a valid response to the A query and an invalid response to the query, then it should ignore the invalid response to the query and return the valid A response to the user as the IP address for the host. Please note, furthermore, that as I explained in detail in my bug report and in my last message, glibc behaves differently based on the /order/ in which the two responses are returned by the DNS server. Since there's nothing that says a DNS server has to respond to two queries in the order in which they were received, and that would be an impossible requirement to impose in any case, since the queries and responses are sent via UDP which doesn' guarantee order, it's perfectly clear that glibc needs to be prepared to function the same regardless of the order in which it receives the responses. I agree that the order of the A/ responses shouldn't matter to the result. The whole getaddrinfo() call should fail regardless of whether the failure is seen first or the valid response is seen first. Why? Because getaddrinfo() should, if it isn't already, be using the RFC 3484 algorithm (and/or whatever the successor to RFC 3484 ends up being) to sort the addresses, and for that algorithm to work, one needs *both* the IPv4 address(es) *and* the IPv6 address(es) available, in order to compare their scopes, prefixes, etc.. If one of the lookups fails, and this failure is presented to the RFC 3484 algorithm as NODATA for a particular address family, then the algorithm could make a bad selection of the destination address, and this can lead to other sorts of breakage, e.g. trying to use a tunneled connection where no tunnel exists. The *safe* thing for glibc to do is to promote the failure of either the A lookup or the lookup to a general lookup failure, which prompts the user/administrator to find the source of the problem and fix it. It's rarely a good idea to mask undeniable errors as if there were no error at all. It leads to unpredictable behavior and really tough troubleshooting challenges. I think glibc is erring on the side of openness and transparency here, rather than trying to cover up the fact that something is horribly wrong. Note your fix won't help clients that only ask for records because it is the authoritative servers that are broken, not the resolver library or the recursive server. I am aware of that. It is irrelevant, because it is not the problem I am trying to solve. I, and 99.99% of the users in the world, are /not/ only ask[ing] for records. Nobody actually trying to use the internet for day-to-day work is doing that right now, because to say that IPv6 support is not yet ubiquitous would be a laughably momentous understatement. What about clients in a NAT64/DNS64 environment? They could be configured as IPv6-only but normally able to access the IPv4 Internet just fine. Even with your glibc fix in place, though, they'll presumably break if the authoritative nameservers are giving garbage responses to queries (could someone with practical experience in DNS64 please confirm this?). Another possibility you're not considering is that the invoking application itself may make independent IPv4-specific and IPv6-specific getaddrinfo() lookups. Why would it do this? Why not? Maybe IPv6 capability is something the user has to buy a separate license for, so the IPv6 part is a slightly separate codepath, added in a later version, than the base product, which is IPv4-only. When one of the getaddrinfo() calls returns address records and the other returns garbage, your fix doesn't prevent such an application from doing something unpredictable, possibly catastrophic. So it's really not a general solution to the problem. Oh, I should also point out that this brokenness by the wikipedia/wikimedia nameservers *isn't* just specific to queries, and therefore *isn't* fixable with getaddrinfo() alone. Try doing an MX query of en.wikipedia.org. Or a PTR query. Or any of the other old (yet non-deprecated) query types (e.g. NS, TXT, HINFO). The only QTYPEs that are answered correctly are A, CNAME and (oddly enough) SOA. So they don't even have the excuse of well, queries are kinda new, we haven't got around to handling them properly yet. This behavior has failed to conform to the standard, for as long as the standard has existed; it's not recent, IPv6-specific breakage.
Re: monitoring BIND
Hello! You should try collectd (http://collectd.org/) and it's bind plugin (http://collectd.org/wiki/index.php/Plugin:BIND) You can put the collected data to csv or RRD on the local server or send it over the network. With RRDtool you can make fancy graphs. With this cgi (http://haroon.sis.utoronto.ca/rrd/scripts/) you could easily visualize the data. Regards, János 2011-07-13 16:43 keltezéssel, Karl Auer írta: We have some nameservers :-) that are used by quite a few thousands of people. Every now and then someone comes to us and complains that the DNS is responding slowly. Sometimes they are right, and we find the problem and fix it. But most of the time everything runs fine, and the DNS is not, in fact, responding slowly when that someone comes to complain. It turns out to be their PC, or a local network issue, or whatever. So we have a homegrown system in place that watches the traffic to and from the nameservers, matches queries to answers, ignores everything else, and notes how long it was between the question going past and the answer going past in the opposite direction. It writes summarised information second by second into a database so we can see exactly when problems with response times happen, how long they happen for, and how bad they are when they happen. Our system has two faults (well, two that we are actually concerned about): It only watches UDP, and it can't deal with fragmented packets. So I was wondering if there is a better solution out there? Regards, K. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: Clients get DNS timeouts because ipv6 means more queries for each lookup
I agree that the order of the A/ responses shouldn't matter to the result. The whole getaddrinfo() call should fail regardless of whether the failure is seen first or the valid response is seen first. Why? Because getaddrinfo() should, if it isn't already, be using the RFC 3484 algorithm (and/or whatever the successor to RFC 3484 ends up being) to sort the addresses, and for that algorithm to work, one needs *both* the IPv4 address(es) *and* the IPv6 address(es) available, in order to compare their scopes, prefixes, etc.. RFC 3484 tells you how to sort addresses you've got. If you've only got one address, then bang! It's already sorted for you. You don't need RFC 3484 to tell you how to sort it. I have to say that some of the people on this list seem completely detached from what real users in the real world want their computers to do. If I am trying to connect to a site on the internet, then I want my computer to do its best to try to connect to the site. I don't want it to throw up its hands and say, Oh, I'm sorry, one of my address lookups failed, so I'm not going to let you use the other address lookup, the one that succeeded, because some RFC somewhere could be interpreted as implying that's a bad idea, if I wanted to do so. Please, that's ridiculous. If one of the lookups fails, and this failure is presented to the RFC 3484 algorithm as NODATA for a particular address family, then the algorithm could make a bad selection of the destination address, and this can lead to other sorts of breakage, e.g. trying to use a tunneled connection where no tunnel exists. If the address the client gets doesn't work, then the address doesn't work. How is being unable to connect because the address turned out to not be routable different from being unable to connect because the computer refused to even try? Another possibility you're not considering is that the invoking application itself may make independent IPv4-specific and IPv6-specific getaddrinfo() lookups. Why would it do this? Why not? Maybe IPv6 capability is something the user has to buy a separate license for, so the IPv6 part is a slightly separate codepath, added in a later version, than the base product, which is IPv4-only. When one of the getaddrinfo() calls returns address records and the other returns garbage, your fix doesn't prevent such an application from doing something unpredictable, possibly catastrophic. So it's really not a general solution to the problem. I have no idea what you're talking about. If the application makes independent IPv4 and IPv6 getaddrinfo() lookups, then the change I'm proposing to glibc is completely irrelevant and does not impact the existing functionality in any way. The IPv4 lookup will succeed, the IPv6 lookup will fail, and the application is then free to decide what to do. In summary, getattrinfo() with AF_UNSPEC has a very clear meaning - Give me whatever addresses you can. The man page says, and I am quoting, The value AF_UNSPEC undicates that getaddrinfo() should return socket addresses for any address family (either IPv4 or IPv6, for example) that can be used with node and service. I don't see how the language could be any more clear. To suggest that it's reasonable and correct for it to refuse to return a successfully fetched address is simply ludicrous. I hope and pray that people who maintain the glibc code have more common sense about what users want and expect from their software. In the meantime, it's clear that I don't belong on this mailing list, so I'm out of here. Jonathan Kamens ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: monitoring BIND
On Thu, 14 Jul 2011 01:27:48 +1000, Karl Auer ka...@biplane.com.au wrote: More info to my question: dig and Nagios have been suggested as possible solutions. dig (and I suspect Nagios, which someone else mentioned) can only test resolution times from one point in the network, or maybe several, and using a very small number of tests. Our current system watches ALL queries and responses to and from the nameservers and summarises ALL the response times, regardless of where the queries came from. For every second of the day we can say what the average, minimum, maximum, etc response times were. We're looking for something that can do that, or something similar... Regards, K. PasTmon can do that from the server side. It listens for network traffic like tcpdump and shovels all of the packet timings into a Postgres database with a nice front-end for graphs and analysis. I can't remember if the DNS plugin has filtering for different query types ( e.g. A, PTR, etc ) but it can probably be written without too much pain. See http://pastmon.sourceforge.net/ I've used it to solve web app performance problems, it should have no trouble dealing with DNS. -- Kerry ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Allowing resolution of off-server CNAMEs
On Fri, Jul 08, 2011 at 10:26:16AM -0700, Chris Buxton wrote: On Jul 8, 2011, at 9:11 AM, Joseph S D Yao wrote: I'd rather that recursion controls only control recursion. And not forwarding - have separate forwarding controls, says I. Forwarding is a response to a recursive query. For an iterative query, even if you have recursion enabled, the server won't forward the query. Therefore, it is logical that it be controlled with the same settings as recursion. What problem are you trying to solve? A dangling CNAME such as you describe is a normal behavior that caching resolvers are easily able to follow. Thanks to those who responded. The real problem is not with sub.tld.example, but with otherzone.faraway.example which works most of the time in most of the world. When it fails, people do an MSW 'nslookup' targeted at my system, and see nothing until I have described to them several times how to get a CNAME record with MSW 'nslookup' and what it means. Yes, not as secure. But less time explaining why. And I realize I have gotten sloppy about the difference between recursive and iterative - bad me! -- /*\ ** ** Joe Yao j...@tux.org - Joseph S. D. Yao ** \*/ ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Clients get DNS timeouts because ipv6 means more queries for each lookup
On 7/13/2011 2:39 PM, Jonathan Kamens wrote: I agree that the order of the A/ responses shouldn't matter to the result. The whole getaddrinfo() call should fail regardless of whether the failure is seen first or the valid response is seen first. Why? Because getaddrinfo() should, if it isn't already, be using the RFC 3484 algorithm (and/or whatever the successor to RFC 3484 ends up being) to sort the addresses, and for that algorithm to work, one needs *both* the IPv4 address(es) *and* the IPv6 address(es) available, in order to compare their scopes, prefixes, etc.. RFC 3484 tells you how to sort addresses you've got. If you've only got one address, then bang! It's already sorted for you. You don't need RFC 3484 to tell you how to sort it. No, you've got one address, and one unspecified nameserver failure. Garbage in, garbage out. To say that a nameserver failure is equivalent to NODATA is not only technically incorrect, it leads to all sorts of operational problems in the real world. I have to say that some of the people on this list seem completely detached from what real users in the real world want their computers to do. Really? Do you think I'm an academic? Do you think I sit and write Internet Drafts and RFCs all day? No, I'm an implementor. I deal with DNS operational problems and issues all day, every workday. And I can tell you that I don't appreciate library routines making wild-ass assumptions that, in the face of some questionable behavior by a nameserver, maybe, possibly some quantity of addresses that I've acquired from that dodgy nameserver are good enough for my clients to try and connect to. No thanks. If there's a real problem I want to know about it as clearly and unambiguously as possible. I can't deal effectively with a problem if it's being masked by some library routine doing something weird behind my back. If I am trying to connect to a site on the internet, then I want my computer to do its best to try to connect to the site. I don't want it to throw up its hands and say, Oh, I'm sorry, one of my address lookups failed, so I'm not going to let you use the /other/ address lookup, the one that succeeded, because some RFC somewhere could be interpreted as implying that's a bad idea, if I wanted to do so. Please, that's ridiculous. No, what's more ridiculous is if users can't get to a site SOME OF THE TIME, because someone's DNS is broken, a moronic library routine then routes the traffic some unexpected way, and a whole raft of other variables enter the picture, without anyone realizing or paying attention to the dependencies and interconnectivity that is required to keep the client working. There is a certain threshold of brokenness where the infrastructure has to throw up its hands, as you put it, and say nuh uh, not gonna happen, because to try to work around the problem based on not enough information about the topology, the environment, the dependencies, etc. you're likely to cause more harm than good by making the failure modes way more complex than necessary. If one of the lookups fails, and this failure is presented to the RFC 3484 algorithm as NODATA for a particular address family, then the algorithm could make a bad selection of the destination address, and this can lead to other sorts of breakage, e.g. trying to use a tunneled connection where no tunnel exists. If the address the client gets doesn't work, then the address doesn't work. How is being unable to connect because the address turned out to not be routable different from being unable to connect because the computer refused to even try? Because the failure modes are substantially different and it could take significant man-hours to determine that the root cause of the problem is actually DNS brokenness rather than something else in the network infrastructure (routers, switches, VPN concentrators, firewalls, IPSes, load-balancers, etc.) or in the client or server (OS, application, middleware, etc.) Have you ever actually troubleshot a difficult connectivity problem in a complex networking environment? Trust me, you want clear symptoms, clear failure modes. Not a bunch of components making dumb assumptions and/or trying to be helpful outside of their defined scope of functionality. That kind of help is like offering a glass of water to a drowning man. Another possibility you're not considering is that the invoking application itself may make independent IPv4-specific and IPv6-specific getaddrinfo() lookups. Why would it do this? Why not? Maybe IPv6 capability is something the user has to buy a separate license for, so the IPv6 part is a slightly separate codepath, added in a later version, than the base product, which is IPv4-only. When one of the getaddrinfo() calls returns address records and the other returns garbage, your fix doesn't prevent such an application from doing something unpredictable, possibly catastrophic.