Re: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-13 Thread Mark Andrews

No.  The fix is to correct the nameservers.  They are not correctly
following the DNS protocol and everything else is a fall out from
that.

 Well, all the prodding from people here prompted me to investigate 
 further exactly what's going on. The problem isn't what I thought it 
 was. It appears to be a bug in glibc, and I've filed a bug report and 
 found a workaround.

There is no bug in glibc.

 In a nutshell, the getaddrinfo function in glibc sends both A and  
 queries to the DNS server at the same time and then deals with the 
 responses as they come in. Unfortunately, if the responses to the two 
 queries come back in reverse order, /and/ the first one to come back is 
 a server failure, both of which are the case when you try to resolve 
 en.wikipedia.org immediately after restarting your DNS server so nothing 
 is cached, the glibc code screws up and decides it didn't get back a 
 successful response even though it did.

There is *nothing* wrong with sending both queries at once.

 If you do the same lookup again, it works, because the CNAME that was 
 sent in response to the A query is cached, so both the A and  
 queries get back valid responses from the DNS server. And even if that 
 weren't the case, since the CNAME is cached it gets returned first, 
 since the server doesn't need to do a query to get it, whereas it does 
 need to do another query to get the  record (which recall isn't 
 being cached because of the previously discussed FORMERR problem). It'll 
 keep working until the cached records time out, at which point it'll 
 happen again, and then be OK again until the records time out, etc.
 
 The workaround is to put options single-request in /etc/resolv.conf to 
 prevent the glibc innards from sending out both the A and  queries 
 at the same time.
 
 FYI, here's the glibc bug I filed about this:
 
 http://sourceware.org/bugzilla/show_bug.cgi?id=12994
 
 Thank you for telling me I was full of it and making me dig deeper into 
 this until I located the actual cause of the issue. :-)
 
jik

Note your fix won't help clients that only ask for  records
because it is the authoritative servers that are broken, not the
resolver library or the recursive server.

Mark
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-13 Thread Jonathan Kamens

On 07/13/2011 02:13 AM, Mark Andrews wrote:

No.  The fix is to correct the nameservers.  They are not correctly
following the DNS protocol and everything else is a fall out from
that.

You're right that everything else is fallout from that.

But that doesn't do me much good, does it? It's my system that keeps 
getting bogus name resolution errors. It's my RSS feed reader that keeps 
failing on an hourly basis when the cached records for en.wikipedia.org 
expire. It's all very well and good to say that the Wikipedia folks and 
other people with this problem should fix their nameservers -- I totally 
agree with that -- but it doesn't help me solve my problem /now/.


I'm a real user in the real world with a real problem. Yelling at 
Wikipedia to fix their DNS servers may feel good, but it doesn't make my 
DNS work. As far as I and all the other users who are being impacted 
/now/ by this problem are concerned, it's just pissing into the wind.

Well, all the prodding from people here prompted me to investigate
further exactly what's going on. The problem isn't what I thought it
was. It appears to be a bug in glibc, and I've filed a bug report and
found a workaround.

There is no bug in glibc.

To be blunt, that's bullshit.

If glibc makes an A query and an  query, and it gets back a valid 
response to the A query and an invalid response to the  query, then 
it should ignore the invalid response to the  query and return the 
valid A response to the user as the IP address for the host.


Please note, furthermore, that as I explained in detail in my bug report 
and in my last message, glibc behaves differently based on the /order/ 
in which the two responses are returned by the DNS server. Since there's 
nothing that says a DNS server has to respond to two queries in the 
order in which they were received, and that would be an impossible 
requirement to impose in any case, since the queries and responses are 
sent via UDP which doesn' guarantee order, it's perfectly clear that 
glibc needs to be prepared to function the same regardless of the order 
in which it receives the responses.


What's more, there's plenty of code in the glibc files I spent hours 
poring over which is clearly an attempt to do exactly that. The people 
who wrote the code just got it wrong. Which isn't surprising, given how 
god-awful the code is.


This is not an either/or situation. The broken nameservers should be 
fixed, /and/ glibc should be fixed to properly handle the case of when 
it sends two queries and gets back one valid response and one server 
error in reverse order.

In a nutshell, the getaddrinfo function in glibc sends both A and 
queries to the DNS server at the same time and then deals with the
responses as they come in. Unfortunately, if the responses to the two
queries come back in reverse order, /and/ the first one to come back is
a server failure, both of which are the case when you try to resolve
en.wikipedia.org immediately after restarting your DNS server so nothing
is cached, the glibc code screws up and decides it didn't get back a
successful response even though it did.

There is *nothing* wrong with sending both queries at once.
I didn't say there was. You really don't seem to be paying very good 
attention.


Do you understand what the word /workaround/ means?

Note your fix won't help clients that only ask for  records
because it is the authoritative servers that are broken, not the
resolver library or the recursive server.
I am aware of that. It is irrelevant, because it is not the problem I am 
trying to solve. I, and 99.99% of the users in the world, are /not/ 
only ask[ing] for  records. Nobody actually trying to use the 
internet for day-to-day work is doing that right now, because to say 
that IPv6 support is not yet ubiquitous would be a laughably momentous 
understatement.


You seem to have a really big chip on your shoulder about people who run 
broken DNS servers. I don't like them any more than you do. But I 
learned Be generous in what you accept and conservative in what you 
generate way back when I started playing with the Internet well over 
two decades ago. It holds up now as well as it did back then, and 
there's no good reason why it shouldn't apply in this case.


It's clear that this is a religious issue for you. I'm not here to 
debate religion, I'm here to get help making my DNS work, and to help 
other people, to whatever extent I can, make /their/ DNS work. If you 
continue to send religious screeds on this topic while making no effort 
to actually read and understand what I write, please do not expect me to 
respond further.


  Jonathan Kamens



smime.p7s
Description: S/MIME Cryptographic Signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

monitoring BIND

2011-07-13 Thread Karl Auer
We have some nameservers :-) that are used by quite a few thousands of
people. Every now and then someone comes to us and complains that the
DNS is responding slowly. Sometimes they are right, and we find the
problem and fix it. But most of the time everything runs fine, and the
DNS is not, in fact, responding slowly when that someone comes to
complain. It turns out to be their PC, or a local network issue, or
whatever.

So we have a homegrown system in place that watches the traffic to and
from the nameservers, matches queries to answers, ignores everything
else, and notes how long it was between the question going past and the
answer going past in the opposite direction. It writes summarised
information second by second into a database so we can see exactly when
problems with response times happen, how long they happen for, and how
bad they are when they happen.

Our system has two faults (well, two that we are actually concerned
about): It only watches UDP, and it can't deal with fragmented packets.

So I was wondering if there is a better solution out there?

Regards, K.

-- 
~~~
Karl Auer (ka...@biplane.com.au)   +61-2-64957160 (h)
http://www.biplane.com.au/kauer/   +61-428-957160 (mob)

GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687
Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156


signature.asc
Description: This is a digitally signed message part
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: monitoring BIND

2011-07-13 Thread Ben Croswell
Nagios is a very move tool for synthetic transaction monitoring. You put in
whatever hosts and host names to resolve and it  does it.

-Ben Croswell
On Jul 13, 2011 11:01 AM, Karl Auer ka...@biplane.com.au wrote:
 We have some nameservers :-) that are used by quite a few thousands of
 people. Every now and then someone comes to us and complains that the
 DNS is responding slowly. Sometimes they are right, and we find the
 problem and fix it. But most of the time everything runs fine, and the
 DNS is not, in fact, responding slowly when that someone comes to
 complain. It turns out to be their PC, or a local network issue, or
 whatever.

 So we have a homegrown system in place that watches the traffic to and
 from the nameservers, matches queries to answers, ignores everything
 else, and notes how long it was between the question going past and the
 answer going past in the opposite direction. It writes summarised
 information second by second into a database so we can see exactly when
 problems with response times happen, how long they happen for, and how
 bad they are when they happen.

 Our system has two faults (well, two that we are actually concerned
 about): It only watches UDP, and it can't deal with fragmented packets.

 So I was wondering if there is a better solution out there?

 Regards, K.

 --
 ~~~
 Karl Auer (ka...@biplane.com.au) +61-2-64957160 (h)
 http://www.biplane.com.au/kauer/ +61-428-957160 (mob)

 GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687
 Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

big improvement in BIND9 auth-server startup time

2011-07-13 Thread Evan Hunt

People who operate big authoritative name servers (particularly with
large numbers of small zones, e.g., for domain hosting and parking),
and have had trouble with slow startup, may find this information
useful:

http://www.isc.org/community/blog/201107/major-improvement-bind-9-startup-performance

-- 
Evan Hunt -- e...@isc.org
Internet Systems Consortium, Inc.
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: monitoring BIND

2011-07-13 Thread Romskie L
Hi Karl,

Have you considered using dig?

-Romskie

On Wed, Jul 13, 2011 at 10:43 PM, Karl Auer ka...@biplane.com.au wrote:
 We have some nameservers :-) that are used by quite a few thousands of
 people. Every now and then someone comes to us and complains that the
 DNS is responding slowly. Sometimes they are right, and we find the
 problem and fix it. But most of the time everything runs fine, and the
 DNS is not, in fact, responding slowly when that someone comes to
 complain. It turns out to be their PC, or a local network issue, or
 whatever.

 So we have a homegrown system in place that watches the traffic to and
 from the nameservers, matches queries to answers, ignores everything
 else, and notes how long it was between the question going past and the
 answer going past in the opposite direction. It writes summarised
 information second by second into a database so we can see exactly when
 problems with response times happen, how long they happen for, and how
 bad they are when they happen.

 Our system has two faults (well, two that we are actually concerned
 about): It only watches UDP, and it can't deal with fragmented packets.

 So I was wondering if there is a better solution out there?

 Regards, K.

 --
 ~~~
 Karl Auer (ka...@biplane.com.au)                   +61-2-64957160 (h)
 http://www.biplane.com.au/kauer/                   +61-428-957160 (mob)

 GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687
 Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156

 ___
 Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
 from this list

 bind-users mailing list
 bind-users@lists.isc.org
 https://lists.isc.org/mailman/listinfo/bind-users

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: monitoring BIND

2011-07-13 Thread Karl Auer
More info to my question:

dig and Nagios have been suggested as possible solutions.

dig (and I suspect Nagios, which someone else mentioned) can only test
resolution times from one point in the network, or maybe several, and
using a very small number of tests.

Our current system watches ALL queries and responses to and from the
nameservers and summarises ALL the response times, regardless of where
the queries came from. For every second of the day we can say what the
average, minimum, maximum, etc response times were.

We're looking for something that can do that, or something similar...

Regards, K.

-- 
~~~
Karl Auer (ka...@biplane.com.au)   +61-2-64957160 (h)
http://www.biplane.com.au/kauer/   +61-428-957160 (mob)

GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687
Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156


signature.asc
Description: This is a digitally signed message part
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: monitoring BIND

2011-07-13 Thread Romskie L
You can use dig to get a sample of the response time and rndc stats to
get query and nameserver statistics.



On Wed, Jul 13, 2011 at 11:15 PM, Romskie L rslara...@gmail.com wrote:
 Hi Karl,

 Have you considered using dig?

 -Romskie

 On Wed, Jul 13, 2011 at 10:43 PM, Karl Auer ka...@biplane.com.au wrote:
 We have some nameservers :-) that are used by quite a few thousands of
 people. Every now and then someone comes to us and complains that the
 DNS is responding slowly. Sometimes they are right, and we find the
 problem and fix it. But most of the time everything runs fine, and the
 DNS is not, in fact, responding slowly when that someone comes to
 complain. It turns out to be their PC, or a local network issue, or
 whatever.

 So we have a homegrown system in place that watches the traffic to and
 from the nameservers, matches queries to answers, ignores everything
 else, and notes how long it was between the question going past and the
 answer going past in the opposite direction. It writes summarised
 information second by second into a database so we can see exactly when
 problems with response times happen, how long they happen for, and how
 bad they are when they happen.

 Our system has two faults (well, two that we are actually concerned
 about): It only watches UDP, and it can't deal with fragmented packets.

 So I was wondering if there is a better solution out there?

 Regards, K.

 --
 ~~~
 Karl Auer (ka...@biplane.com.au)                   +61-2-64957160 (h)
 http://www.biplane.com.au/kauer/                   +61-428-957160 (mob)

 GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687
 Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156

 ___
 Please visit https://lists.isc.org/mailman/listinfo/bind-users to 
 unsubscribe from this list

 bind-users mailing list
 bind-users@lists.isc.org
 https://lists.isc.org/mailman/listinfo/bind-users


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: monitoring BIND

2011-07-13 Thread Phil Mayers

On 07/13/2011 03:43 PM, Karl Auer wrote:


So I was wondering if there is a better solution out there?


People I know speak highly of DSC:

http://dns.measurement-factory.com/tools/dsc/index.html
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: monitoring BIND

2011-07-13 Thread Dave Knight

Sorry for contributing another non-answer, just wanted to comment that I have 
done something very similar once upon a time...

The case was a DNS authority service anycast node with:

2 Internet Facing Routers -- 2 Load Balancing Switches -- Big Stack of Servers

We had seen degraded performance reported by RIPE NCC's DNSMON but weren't sure 
if the problem was Internet routing, or inside our nodes, and if inside our 
nodes was it the server, or the load balancer, etc. 

We set up traffic capture with tcpdump at strategic points within the node, ie: 
between the router and load balancer, between the load balancer and the 
servers, on each server. With a good sample of the traffic, say an hour or so, 
we could then pull the DNSMON raw data for that same time period, and match the 
queries it sent to us (the DNSMON raw data contains the query id) against what 
we saw inside our node and verify that we saw it, answered it, and that the 
answer made it back out into the Internet. We could also see what path the 
query and answer took through the node and where any delays might be.

This very quickly led us to the load balancers as the cause of the delays and 
we were able to fix them.

We never felt the need to run this on an ongoing basis, once our servers looked 
green in DNSMON again we were happy that all was well in our world. We used it 
for diagnosis, rather than detection as it sounds like you want to do.

dave


On 2011-07-13, at 11:27 AM, Karl Auer wrote:

 More info to my question:
 
 dig and Nagios have been suggested as possible solutions.
 
 dig (and I suspect Nagios, which someone else mentioned) can only test
 resolution times from one point in the network, or maybe several, and
 using a very small number of tests.
 
 Our current system watches ALL queries and responses to and from the
 nameservers and summarises ALL the response times, regardless of where
 the queries came from. For every second of the day we can say what the
 average, minimum, maximum, etc response times were.
 
 We're looking for something that can do that, or something similar...
 
 Regards, K.
 
 -- 
 ~~~
 Karl Auer (ka...@biplane.com.au)   +61-2-64957160 (h)
 http://www.biplane.com.au/kauer/   +61-428-957160 (mob)
 
 GPG fingerprint: DA41 51B1 1481 16E1 F7E2 B2E9 3007 14ED 5736 F687
 Old fingerprint: B386 7819 B227 2961 8301 C5A9 2EBC 754B CD97 0156
 ___
 Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
 from this list
 
 bind-users mailing list
 bind-users@lists.isc.org
 https://lists.isc.org/mailman/listinfo/bind-users

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-13 Thread Kevin Darcy

On 7/13/2011 2:35 AM, Jonathan Kamens wrote:

On 07/13/2011 02:13 AM, Mark Andrews wrote:

Well, all the prodding from people here prompted me to investigate
further exactly what's going on. The problem isn't what I thought it
was. It appears to be a bug in glibc, and I've filed a bug report and
found a workaround.

There is no bug in glibc.

To be blunt, that's bullshit.

If glibc makes an A query and an  query, and it gets back a valid 
response to the A query and an invalid response to the  query, 
then it should ignore the invalid response to the  query and 
return the valid A response to the user as the IP address for the host.


Please note, furthermore, that as I explained in detail in my bug 
report and in my last message, glibc behaves differently based on the 
/order/ in which the two responses are returned by the DNS server. 
Since there's nothing that says a DNS server has to respond to two 
queries in the order in which they were received, and that would be an 
impossible requirement to impose in any case, since the queries and 
responses are sent via UDP which doesn' guarantee order, it's 
perfectly clear that glibc needs to be prepared to function the same 
regardless of the order in which it receives the responses.
I agree that the order of the A/ responses shouldn't matter to the 
result. The whole getaddrinfo() call should fail regardless of whether 
the failure is seen first or the valid response is seen first. Why? 
Because getaddrinfo() should, if it isn't already, be using the RFC 3484 
algorithm (and/or whatever the successor to RFC 3484 ends up being) to 
sort the addresses, and for that algorithm to work, one needs *both* the 
IPv4 address(es) *and* the IPv6 address(es) available, in order to 
compare their scopes, prefixes, etc.. If one of the lookups fails, and 
this failure is presented to the RFC 3484 algorithm as NODATA for a 
particular address family, then the algorithm could make a bad selection 
of the destination address, and this can lead to other sorts of 
breakage, e.g. trying to use a tunneled connection where no tunnel 
exists.  The *safe* thing for glibc to do is to promote the failure of 
either the A lookup or the  lookup to a general lookup failure, 
which prompts the user/administrator to find the source of the problem 
and fix it.


It's rarely a good idea to mask undeniable errors as if there were no 
error at all. It leads to unpredictable behavior and really tough 
troubleshooting challenges. I think glibc is erring on the side of 
openness and transparency here, rather than trying to cover up the fact 
that something is horribly wrong.





Note your fix won't help clients that only ask for  records
because it is the authoritative servers that are broken, not the
resolver library or the recursive server.
I am aware of that. It is irrelevant, because it is not the problem I 
am trying to solve. I, and 99.99% of the users in the world, are 
/not/ only ask[ing] for  records. Nobody actually trying to use 
the internet for day-to-day work is doing that right now, because to 
say that IPv6 support is not yet ubiquitous would be a laughably 
momentous understatement.
What about clients in a NAT64/DNS64 environment? They could be 
configured as IPv6-only but normally able to access the IPv4 Internet 
just fine. Even with your glibc fix in place, though, they'll 
presumably break if the authoritative nameservers are giving garbage 
responses to  queries (could someone with practical experience in 
DNS64 please confirm this?).


Another possibility you're not considering is that the invoking 
application itself may make independent IPv4-specific and IPv6-specific 
getaddrinfo() lookups. Why would it do this? Why not? Maybe IPv6 
capability is something the user has to buy a separate license for, so 
the IPv6 part is a slightly separate codepath, added in a later version, 
than the base product, which is IPv4-only. When one of the getaddrinfo() 
calls returns address records and the other returns garbage, your fix 
doesn't prevent such an application from doing something unpredictable, 
possibly catastrophic. So it's really not a general solution to the problem.





- Kevin
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-13 Thread Kevin Darcy

On 7/13/2011 1:06 PM, Kevin Darcy wrote:

On 7/13/2011 2:35 AM, Jonathan Kamens wrote:

On 07/13/2011 02:13 AM, Mark Andrews wrote:

Well, all the prodding from people here prompted me to investigate
further exactly what's going on. The problem isn't what I thought it
was. It appears to be a bug in glibc, and I've filed a bug report and
found a workaround.

There is no bug in glibc.

To be blunt, that's bullshit.

If glibc makes an A query and an  query, and it gets back a valid 
response to the A query and an invalid response to the  query, 
then it should ignore the invalid response to the  query and 
return the valid A response to the user as the IP address for the host.


Please note, furthermore, that as I explained in detail in my bug 
report and in my last message, glibc behaves differently based on the 
/order/ in which the two responses are returned by the DNS server. 
Since there's nothing that says a DNS server has to respond to two 
queries in the order in which they were received, and that would be 
an impossible requirement to impose in any case, since the queries 
and responses are sent via UDP which doesn' guarantee order, it's 
perfectly clear that glibc needs to be prepared to function the same 
regardless of the order in which it receives the responses.
I agree that the order of the A/ responses shouldn't matter to the 
result. The whole getaddrinfo() call should fail regardless of whether 
the failure is seen first or the valid response is seen first. Why? 
Because getaddrinfo() should, if it isn't already, be using the RFC 
3484 algorithm (and/or whatever the successor to RFC 3484 ends up 
being) to sort the addresses, and for that algorithm to work, one 
needs *both* the IPv4 address(es) *and* the IPv6 address(es) 
available, in order to compare their scopes, prefixes, etc.. If one of 
the lookups fails, and this failure is presented to the RFC 3484 
algorithm as NODATA for a particular address family, then the 
algorithm could make a bad selection of the destination address, and 
this can lead to other sorts of breakage, e.g. trying to use a 
tunneled connection where no tunnel exists.  The *safe* thing for 
glibc to do is to promote the failure of either the A lookup or the 
 lookup to a general lookup failure, which prompts the 
user/administrator to find the source of the problem and fix it.


It's rarely a good idea to mask undeniable errors as if there were no 
error at all. It leads to unpredictable behavior and really tough 
troubleshooting challenges. I think glibc is erring on the side of 
openness and transparency here, rather than trying to cover up the 
fact that something is horribly wrong.





Note your fix won't help clients that only ask for  records
because it is the authoritative servers that are broken, not the
resolver library or the recursive server.
I am aware of that. It is irrelevant, because it is not the problem I 
am trying to solve. I, and 99.99% of the users in the world, are 
/not/ only ask[ing] for  records. Nobody actually trying to use 
the internet for day-to-day work is doing that right now, because to 
say that IPv6 support is not yet ubiquitous would be a laughably 
momentous understatement.
What about clients in a NAT64/DNS64 environment? They could be 
configured as IPv6-only but normally able to access the IPv4 Internet 
just fine. Even with your glibc fix in place, though, they'll 
presumably break if the authoritative nameservers are giving garbage 
responses to  queries (could someone with practical experience in 
DNS64 please confirm this?).


Another possibility you're not considering is that the invoking 
application itself may make independent IPv4-specific and 
IPv6-specific getaddrinfo() lookups. Why would it do this? Why not? 
Maybe IPv6 capability is something the user has to buy a separate 
license for, so the IPv6 part is a slightly separate codepath, added 
in a later version, than the base product, which is IPv4-only. When 
one of the getaddrinfo() calls returns address records and the other 
returns garbage, your fix doesn't prevent such an application from 
doing something unpredictable, possibly catastrophic. So it's really 
not a general solution to the problem.
Oh, I should also point out that this brokenness by the 
wikipedia/wikimedia nameservers *isn't* just specific to  queries, 
and therefore *isn't* fixable with getaddrinfo() alone. Try doing an 
MX query of en.wikipedia.org. Or a PTR query. Or any of the other old 
(yet non-deprecated) query types (e.g. NS, TXT, HINFO). The only QTYPEs 
that are answered correctly are A, CNAME and (oddly enough) SOA. So they 
don't even have the excuse of well,  queries are kinda new, we 
haven't got around to handling them properly yet. This behavior has 
failed to conform to the standard, for as long as the standard has 
existed; it's not recent, IPv6-specific breakage.


  

Re: monitoring BIND

2011-07-13 Thread Pásztor János

Hello!

You should try collectd (http://collectd.org/) and it's bind plugin 
(http://collectd.org/wiki/index.php/Plugin:BIND) You can put the 
collected data to csv or RRD on the local server or send it over the 
network. With RRDtool you can make fancy graphs. With this cgi 
(http://haroon.sis.utoronto.ca/rrd/scripts/) you could easily visualize 
the data.


Regards,
János

2011-07-13 16:43 keltezéssel, Karl Auer írta:

We have some nameservers :-) that are used by quite a few thousands of
people. Every now and then someone comes to us and complains that the
DNS is responding slowly. Sometimes they are right, and we find the
problem and fix it. But most of the time everything runs fine, and the
DNS is not, in fact, responding slowly when that someone comes to
complain. It turns out to be their PC, or a local network issue, or
whatever.

So we have a homegrown system in place that watches the traffic to and
from the nameservers, matches queries to answers, ignores everything
else, and notes how long it was between the question going past and the
answer going past in the opposite direction. It writes summarised
information second by second into a database so we can see exactly when
problems with response times happen, how long they happen for, and how
bad they are when they happen.

Our system has two faults (well, two that we are actually concerned
about): It only watches UDP, and it can't deal with fragmented packets.

So I was wondering if there is a better solution out there?

Regards, K.



___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

RE: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-13 Thread Jonathan Kamens
I agree that the order of the A/ responses shouldn't matter to the
result. The whole getaddrinfo() call should fail regardless of whether the
failure is seen first or the valid response is seen first. Why? Because
getaddrinfo() should, if it isn't already, be using the RFC 3484 algorithm
(and/or whatever the successor to RFC 3484 ends up being) to sort the
addresses, and for that algorithm to work, one needs *both* the IPv4
address(es) *and* the IPv6 address(es) available, in order to compare their
scopes, prefixes, etc..

 

RFC 3484 tells you how to sort addresses you've got.

 

If you've only got one address, then bang! It's already sorted for you. You
don't need RFC 3484 to tell you how to sort it.

 

I have to say that some of the people on this list seem completely detached
from what real users in the real world want their computers to do.

 

If I am trying to connect to a site on the internet, then I want my computer
to do its best to try to connect to the site. I don't want it to throw up
its hands and say, Oh, I'm sorry, one of my address lookups failed, so I'm
not going to let you use the other address lookup, the one that succeeded,
because some RFC somewhere could be interpreted as implying that's a bad
idea, if I wanted to do so. Please, that's ridiculous.

 

If one of the lookups fails, and this failure is presented to the RFC 3484
algorithm as NODATA for a particular address family, then the algorithm
could make a bad selection of the destination address, and this can lead to
other sorts of breakage, e.g. trying to use a tunneled connection where no
tunnel exists.

 

If the address the client gets doesn't work, then the address doesn't work.
How is being unable to connect because the address turned out to not be
routable different from being unable to connect because the computer refused
to even try?



Another possibility you're not considering is that the invoking application
itself may make independent IPv4-specific and IPv6-specific getaddrinfo()
lookups. Why would it do this? Why not? Maybe IPv6 capability is something
the user has to buy a separate license for, so the IPv6 part is a slightly
separate codepath, added in a later version, than the base product, which is
IPv4-only. When one of the getaddrinfo() calls returns address records and
the other returns garbage, your fix doesn't prevent such an application
from doing something unpredictable, possibly catastrophic. So it's really
not a general solution to the problem.

 

I have no idea what you're talking about. If the application makes
independent IPv4 and IPv6 getaddrinfo() lookups, then the change I'm
proposing to glibc is completely irrelevant and does not impact the existing
functionality in any way. The IPv4 lookup will succeed, the IPv6 lookup will
fail, and the application is then free to decide what to do.

 

In summary, getattrinfo() with AF_UNSPEC has a very clear meaning - Give me
whatever addresses you can. The man page says, and I am quoting, The value
AF_UNSPEC undicates that getaddrinfo() should return socket addresses for
any address family (either IPv4 or IPv6, for example) that can be used with
node and service. I don't see how the language could be any more clear. To
suggest that it's reasonable and correct for it to refuse to return a
successfully fetched address is simply ludicrous.

 

I hope and pray that people who maintain the glibc code have more common
sense about what users want and expect from their software.

 

In the meantime, it's clear that I don't belong on this mailing list, so I'm
out of here.

 

  Jonathan Kamens

 

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: monitoring BIND

2011-07-13 Thread Kerry Thompson
On Thu, 14 Jul 2011 01:27:48 +1000, Karl Auer ka...@biplane.com.au
wrote:
 More info to my question:
 
 dig and Nagios have been suggested as possible solutions.
 
 dig (and I suspect Nagios, which someone else mentioned) can only test
 resolution times from one point in the network, or maybe several, and
 using a very small number of tests.
 
 Our current system watches ALL queries and responses to and from the
 nameservers and summarises ALL the response times, regardless of where
 the queries came from. For every second of the day we can say what the
 average, minimum, maximum, etc response times were.
 
 We're looking for something that can do that, or something similar...
 
 Regards, K.

PasTmon can do that from the server side. It listens for network traffic
like tcpdump and shovels all of the packet timings into a Postgres database
with a nice front-end for graphs and analysis. I can't remember if the DNS
plugin has filtering for different query types ( e.g. A, PTR, etc ) but it
can probably be written without too much pain.

See http://pastmon.sourceforge.net/

I've used it to solve web app performance problems, it should have no
trouble dealing with DNS.


-- 
Kerry
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Allowing resolution of off-server CNAMEs

2011-07-13 Thread Joseph S D Yao
On Fri, Jul 08, 2011 at 10:26:16AM -0700, Chris Buxton wrote:
 On Jul 8, 2011, at 9:11 AM, Joseph S D Yao wrote:
  I'd rather that recursion controls only control recursion.
  And not forwarding - have separate forwarding controls, says I.
 
 Forwarding is a response to a recursive query. For an iterative query, even 
 if you have recursion enabled, the server won't forward the query. Therefore, 
 it is logical that it be controlled with the same settings as recursion.
 
 What problem are you trying to solve? A dangling CNAME such as you describe 
 is a normal behavior that caching resolvers are easily able to follow.


Thanks to those who responded.

The real problem is not with sub.tld.example, but with
otherzone.faraway.example which works most of the time in most of the
world.  When it fails, people do an MSW 'nslookup' targeted at my
system, and see nothing until I have described to them several times how
to get a CNAME record with MSW 'nslookup' and what it means.

Yes, not as secure.  But less time explaining why.

And I realize I have gotten sloppy about the difference between
recursive and iterative - bad me!


--
/*\
**
** Joe Yao  j...@tux.org - Joseph S. D. Yao
**
\*/
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Clients get DNS timeouts because ipv6 means more queries for each lookup

2011-07-13 Thread Kevin Darcy

On 7/13/2011 2:39 PM, Jonathan Kamens wrote:


I agree that the order of the A/ responses shouldn't matter to the 
result. The whole getaddrinfo() call should fail regardless of whether 
the failure is seen first or the valid response is seen first. Why? 
Because getaddrinfo() should, if it isn't already, be using the RFC 
3484 algorithm (and/or whatever the successor to RFC 3484 ends up 
being) to sort the addresses, and for that algorithm to work, one 
needs *both* the IPv4 address(es) *and* the IPv6 address(es) 
available, in order to compare their scopes, prefixes, etc..


RFC 3484 tells you how to sort addresses you've got.

If you've only got one address, then bang! It's already sorted for 
you. You don't need RFC 3484 to tell you how to sort it.


No, you've got one address, and one unspecified nameserver failure. 
Garbage in, garbage out. To say that a nameserver failure is equivalent 
to NODATA is not only technically incorrect, it leads to all sorts of 
operational problems in the real world.


I have to say that some of the people on this list seem completely 
detached from what real users in the real world want their computers 
to do.


Really? Do you think I'm an academic? Do you think I sit and write 
Internet Drafts and RFCs all day? No, I'm an implementor. I deal with 
DNS operational problems and issues all day, every workday. And I can 
tell you that I don't appreciate library routines making wild-ass 
assumptions that, in the face of some questionable behavior by a 
nameserver, maybe, possibly some quantity of addresses that I've 
acquired from that dodgy nameserver are good enough for my clients to 
try and connect to. No thanks. If there's a real problem I want to know 
about it as clearly and unambiguously as possible. I can't deal 
effectively with a problem if it's being masked by some library routine 
doing something weird behind my back.


If I am trying to connect to a site on the internet, then I want my 
computer to do its best to try to connect to the site. I don't want it 
to throw up its hands and say, Oh, I'm sorry, one of my address 
lookups failed, so I'm not going to let you use the /other/ address 
lookup, the one that succeeded, because some RFC somewhere could be 
interpreted as implying that's a bad idea, if I wanted to do so. 
Please, that's ridiculous.


No, what's more ridiculous is if users can't get to a site SOME OF THE 
TIME, because someone's DNS is broken, a moronic library routine then 
routes the traffic some unexpected way, and a whole raft of other 
variables enter the picture, without anyone realizing or paying 
attention to the dependencies and interconnectivity that is required to 
keep the client working. There is a certain threshold of brokenness 
where the infrastructure has to throw up its hands, as you put it, and 
say nuh uh, not gonna happen, because to try to work around the 
problem based on not enough information about the topology, the 
environment, the dependencies, etc. you're likely to cause more harm 
than good by making the failure modes way more complex than necessary.


If one of the lookups fails, and this failure is presented to the 
RFC 3484 algorithm as NODATA for a particular address family, then the 
algorithm could make a bad selection of the destination address, and 
this can lead to other sorts of breakage, e.g. trying to use a 
tunneled connection where no tunnel exists.


If the address the client gets doesn't work, then the address doesn't 
work. How is being unable to connect because the address turned out to 
not be routable different from being unable to connect because the 
computer refused to even try?


Because the failure modes are substantially different and it could take 
significant man-hours to determine that the root cause of the problem is 
actually DNS brokenness rather than something else in the network 
infrastructure (routers, switches, VPN concentrators, firewalls, IPSes, 
load-balancers, etc.) or in the client or server (OS, application, 
middleware, etc.)


Have you ever actually troubleshot a difficult connectivity problem in a 
complex networking environment? Trust me, you want clear symptoms, clear 
failure modes. Not a bunch of components making dumb assumptions and/or 
trying to be helpful outside of their defined scope of functionality. 
That kind of help is like offering a glass of water to a drowning man.



Another possibility you're not considering is that the invoking 
application itself may make independent IPv4-specific and 
IPv6-specific getaddrinfo() lookups. Why would it do this? Why not? 
Maybe IPv6 capability is something the user has to buy a separate 
license for, so the IPv6 part is a slightly separate codepath, added 
in a later version, than the base product, which is IPv4-only. When 
one of the getaddrinfo() calls returns address records and the other 
returns garbage, your fix doesn't prevent such an application from 
doing something unpredictable, possibly catastrophic.