DNS Redundancy

2010-10-21 Thread Martin McCormick
The normal procedure on internet-connected systems is to
set the resolv.conf file to include at least 2 domain name
servers. Example:

nameserver  139.78.100.1
nameserver  139.78.200.1

Last night, I had to take down our primary DNS for
maintenance and lots of FreeBSD and Linux systems began having trouble of 
various
kinds.

While I expected the FreeBSD system I was on to hang for
a couple of seconds and then start using the second DNS, it
basically froze while some Linux boxes also began exhibiting
similar behavior.

I finally manually changed the resolv.conf on the system
I was using to force the slave DNS to be first in the list and
that helped, but loosing the primary DNS was not the slight
slowdown one might expect. It was a full-blown outage.

Are we missing some other configuration directive for Unix systems
that would make the systems use the redundancy a little
more gracefully than what happened? Otherwise, why have it if
somebody has to manually intervene? The only thing we should
have lost was dynamic updates. The outage lasted for 25 minutes
or so but didn't resolve until the primary came back on line.

This is my week for asking novice questions, but I don't
get to see what happens when the master goes away all that often
and what I saw wasn't pretty.

Martin McCormick WB5AGZ  Stillwater, OK 
Systems Engineer
OSU Information Technology Department Telecommunications Services Group
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: DNS Redundancy

2010-10-21 Thread Stephane Bortzmeyer
On Thu, Oct 21, 2010 at 06:32:09AM -0500,
 Martin McCormick mar...@dc.cis.okstate.edu wrote 
 a message of 39 lines which said:

 Example:
 
 nameserver139.78.100.1
 nameserver139.78.200.1

I always add:

timeout:1

because the default timeout is 5 seconds, much too important to allow
for a smooth fallback.

Other options could be interesting, such as rotate. See
resolv.conf(5).

Unlike the failure of an authoritative name server, the failure of a
resolver is not really transparent for the Unix stub resolver, as you
have discovered. You may consider solutions using a redundancy at
layer 3 such as VRRP or CARP.

___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: DNS Redundancy

2010-10-21 Thread Niall O'Reilly

On 21 Oct 2010, at 12:32, Martin McCormick wrote:

   The normal procedure on internet-connected systems is to
 set the resolv.conf file to include at least 2 domain name
 servers. Example:
 
 nameserver139.78.100.1
 nameserver139.78.200.1
 
   Last night, I had to take down our primary DNS for
 maintenance and lots of FreeBSD and Linux systems began having trouble of 
 various
 kinds.
 
   While I expected the FreeBSD system I was on to hang for
 a couple of seconds and then start using the second DNS, it
 basically froze while some Linux boxes also began exhibiting
 similar behavior.
 
   I finally manually changed the resolv.conf on the system
 I was using to force the slave DNS to be first in the list and
 that helped, but loosing the primary DNS was not the slight
 slowdown one might expect. It was a full-blown outage.

It's a good idea to keep your authoritative name service
(for announcing DNS records for your part of the DNS) separate
from your resolver name service (for mediating name service 
to the clients on your network).

/etc/resolv.conf (or equivalent on other platforms) specifies
where the client should look for resolver service.  The
addresses in there should best not be those of the master
or slave server for your DNS zone(s).

Without more detail, it's difficult to say exactly what chain
of cause and effect led to your full-blown outage.

It's well to bear in mind that the typical (Unix-like) client
will always step through the nameserver addresses in the order
in which they appear in /etc/resolv.conf.  If you're planning to
take one of them down for maintenance, and wish to avoid
client-side delays, you need either to configure the clients
in advance (for example, by using DHCP) with a different 
/etc/resolv.conf.  Alternatively, you might instantiate the
first address in the list on the second server.  There is no
one true way.

On the other hand, dedicated resolver servers (at least those
running BIND named) keep track of the state of the authoritative
servers for the names for which they are processing queries, and
automagically ignore any that are unreachable.  This allows my
customers (for example) to be spared delay when you take one of
your authoritative servers down.

Best regards,
Niall O'Reilly

___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: DNS Redundancy

2010-10-21 Thread Phil Mayers

On 21/10/10 12:50, Stephane Bortzmeyer wrote:


Unlike the failure of an authoritative name server, the failure of a
resolver is not really transparent for the Unix stub resolver, as you
have discovered. You may consider solutions using a redundancy at
layer 3 such as VRRP or CARP.


Yeah, we've observed this.

Our primary and secondary DNS IPs are actually virtual IPs; one is via a 
layer4 loadbalancer, the other via an eBGP injected route (for 
diversity) pointing at 4 real resolvers.


You can alleviate it with nscd on the clients, but that has its own 
problems.

___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: DNS Redundancy

2010-10-21 Thread lhecking
Stephane Bortzmeyer writes:
 On Thu, Oct 21, 2010 at 06:32:09AM -0500,
  Martin McCormick mar...@dc.cis.okstate.edu wrote 
  a message of 39 lines which said:
 
  Example:
  
  nameserver  139.78.100.1
  nameserver  139.78.200.1
 
 I always add:
 
 timeout:1
 
 because the default timeout is 5 seconds, much too important to allow
 for a smooth fallback.
 
 Other options could be interesting, such as rotate. See
 resolv.conf(5).
 
 Nearly off-topic, but how does one specify such options via dhcp?



---
This message and any attachments may contain Cypress (or its
subsidiaries) confidential information. If it has been received
in error, please advise the sender and immediately delete this
message.
---

___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


DNS Redundancy, Round 2

2010-10-21 Thread Stewart Dean
A slightly different, but allied question: we are seeing a situation where (Red 
Hat or CentOS) servers with 2 nameservers in their resolv.conf files nearly hang 
in name resolution with 2 nameservers listed, but run quickly if one of the 
nameservers is deleted from the resolve.conf.  Both the referenced nameservers 
are on the same internal subnet, 10.5.0.2, 10.5.0.3


The two internal nameservers are running AIXV5.3 and BIND 9.2.1

I haven't delved into this yet, but I'd welcome suggestions on where I should be 
looking.



--
One must think like a hero to behave like a merely decent human being. - May 
Sarton Stewart Dean, Unix System Admin, Bard College, New York 12504 
sd...@bard.edu voice: 845-758-7475, fax: 845-758-7035

___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: DNS Redundancy

2010-10-21 Thread Stephane Bortzmeyer
On Thu, Oct 21, 2010 at 02:27:52PM +0100,
 lheck...@users.sourceforge.net lheck...@users.sourceforge.net wrote 
 a message of 35 lines which said:

  Other options could be interesting, such as rotate. See
  resolv.conf(5).
  
  Nearly off-topic, but how does one specify such options via dhcp?

It depends on the DHCP client you use. With pump, you can use
--noresolvconf. For ISC client, man dhclient.
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: DNS Redundancy

2010-10-21 Thread Gordon A. Lang

We have been very successful using any-casting whereby multiple,
equivalently-configured DNS servers are placed throughout the network,
all providing DNS service on the same virtual addresses, and these
virtual addresses are host-routed (i.e. route with slash-32 netmask).

The keys to this working well are:
 1. Host routes are dynamically asserted or withdrawn based on health
of the DNS service on each server.
 2. Packet flow paths are stable across the network (for tcp based
queries).
 3. Publish two any-cast resolver addresses.


I have seen people run dynamic routing protocols on the servers (e.g.
ripv2 or ospf) combined with cron-driven health check scripts that
control the dynamic routing of the virtual address.  We have also used
load balancers to handle the server health monitoring and the dynamic
routing -- only because the load balancers happened to be convenient
-- I would not use a load balancer otherwise.  But I prefer the Cisco
IP SLA idea to both monitor the server health and control the host
routes (although I have not tested this).

The stable path requirement is easy with Cisco CEF as long as you do
not use per-packet load sharing.

It is actually counter-productive to have two resolvers configured
with this architecture, but to circumvent human nature, we publish two.

There is absolutely no functional difference between the two, and
there is no redundancy value for the second one -- they are both
hosted on each and every one of the any-cast servers.  The only
reason for the the second resolver is to deter people from making
up their own second resolver -- people expect two resolvers, and
if you give them only one, they will go ahead and put something in
as the second resolver -- even if you tell them not to.  This is a
very important aspect of having the architecture succeed in our
environment.

--
Gordon A. Lang

- Original Message - 
From: Martin McCormick mar...@dc.cis.okstate.edu

To: bind-us...@isc.org
Sent: Thursday, October 21, 2010 7:32 AM
Subject: DNS Redundancy



The normal procedure on internet-connected systems is to
set the resolv.conf file to include at least 2 domain name
servers. Example:

nameserver 139.78.100.1
nameserver 139.78.200.1

Last night, I had to take down our primary DNS for
maintenance and lots of FreeBSD and Linux systems began having trouble of 
various

kinds.

While I expected the FreeBSD system I was on to hang for
a couple of seconds and then start using the second DNS, it
basically froze while some Linux boxes also began exhibiting
similar behavior.

I finally manually changed the resolv.conf on the system
I was using to force the slave DNS to be first in the list and
that helped, but loosing the primary DNS was not the slight
slowdown one might expect. It was a full-blown outage.

Are we missing some other configuration directive for Unix systems
that would make the systems use the redundancy a little
more gracefully than what happened? Otherwise, why have it if
somebody has to manually intervene? The only thing we should
have lost was dynamic updates. The outage lasted for 25 minutes
or so but didn't resolve until the primary came back on line.

This is my week for asking novice questions, but I don't
get to see what happens when the master goes away all that often
and what I saw wasn't pretty.

Martin McCormick WB5AGZ  Stillwater, OK
Systems Engineer
OSU Information Technology Department Telecommunications Services Group
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users



___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: DNS Redundancy

2010-10-21 Thread Michael Sinatra

On 10/21/10 08:26, Gordon A. Lang wrote:


It is actually counter-productive to have two resolvers configured
with this architecture, but to circumvent human nature, we publish two.

There is absolutely no functional difference between the two, and
there is no redundancy value for the second one -- they are both
hosted on each and every one of the any-cast servers. The only
reason for the the second resolver is to deter people from making
up their own second resolver -- people expect two resolvers, and
if you give them only one, they will go ahead and put something in
as the second resolver -- even if you tell them not to. This is a
very important aspect of having the architecture succeed in our
environment.


I mentioned this in another thread (perhaps on another list!), but there 
are reasons you might want to have two separate redundant anycast clouds 
and configure two servers in client stub resolvers.


Background: We have been doing anycast within our OSPF IGP since 1999 
for DNS.  Initially, we announced all resolver addresses from one set of 
anycast servers, and each server advertised all configured addresses (we 
had 4 back then for historical reasons).  On very rare occasions, we 
would have a weird error where a system would be unable to fork new 
processes (such as the cron script to verify health of the server) or 
the kernel would get into a weird bogged-down state where named would 
effectively stop working but the system wouldn't get taken out of 
routing. (That one turned out to be a kernel bug.) Clients within the 
anycast catchment of such a server would be stuck talking over and over 
to the same broken server.  We now have two separate sets of anycast 
servers so that the resolvers can still fail to a different set of 
servers as a last resort.  Having the stub resolver's own failover 
mechanism in place provides an extra layer of protection, provided you 
have separate anycast clouds.  This is now considered a best practice.


See slide 38 of Woody's presentation here:

http://www.pch.net/resources/papers/ipv4-anycast/ipv4-anycast.pdf

michael
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users