Re: [dns-operations] DNS load-balancing/failover using an ASR 9xxx (few questions)

2014-08-15 Thread Anand Buddhdev
On 15/08/2014 00:00, Nat Morris wrote:

 BGP sessions between the ASR 9 and each DNS server in the cluster,
 ExaBGP running on them announcing their loopback/service /32 + /128
 address(es).
 
 Health check scripts on each service to probe for service ability,
 retract the announcement upon failure.

We are doing this exact same thing on many RIPE NCC DNS servers, and it
works very well. The other advantage of BGP is that as soon as you
withdraw the announcement, the router stops sending traffic to the
server. With OSPF, you have timeouts of several seconds before traffic
stops arriving at a dead server.

Regards,

Anand
___
dns-operations mailing list
dns-operations@lists.dns-oarc.net
https://lists.dns-oarc.net/mailman/listinfo/dns-operations
dns-jobs mailing list
https://lists.dns-oarc.net/mailman/listinfo/dns-jobs


Re: [dns-operations] DNS load-balancing/failover using an ASR 9xxx (few questions)

2014-08-15 Thread Costantino Andrea (Con)
We do the same with Quagga or BIRD on Linux and ospf daemon for georedundancy 
and load sharing with proximity for customer access to recursive bind resolvers.

avoiding tedious specific need in our case, we have the primary/secondary DNS 
IPs announced as loopback by the system.
We don't have any specific monitoring to bring OSFP down on the server since we 
have lots of them (4 per POP) and specific scripted and human monitoring 24x7, 
so if a server has issue the customer barely notices it before the human acts 
and bring down the server affected.
we also had power surge in a POP that brought it offline entirely on DNS side 
(network was on dc while problem affected ac power only for some racks), and 30 
seconds after the service was up again using dnses of another pop. very 
effective given the giant fail we had.

about the timeouts, you don't need to wait if you bring down the loopbacks 
instead of the ospf daemon. after downing the loopbacks the ospf notifies he 
does not have those IPs anymore and upstream routers load share only on 
remaining servers.
then you can shut the daemon down.

I wondered if using the probe, but found the it was an overkill in our case 
since a simple transient hang in the network (STP issue, mismatched cabling) 
could have brought down an entire POP for a minor event. We preferred to have 
human monitoring instead since a 24x7 service was already there for network 
alarms and could easily correlate with other causes or real server issue.

We didn't had a single sw failure in more then 7 years with four different 
installations (RHEL 3, Centos 4,5,6) in a very complex environment due to 
efficency and law constraints (we have upstream DNS providing DNS poisoning for 
law requirement and a shared caching for all the anycast dnses).

Ciao,
A.


Il giorno 15/ago/2014, alle ore 09:46, Anand Buddhdev ana...@ripe.net ha 
scritto:

 On 15/08/2014 00:00, Nat Morris wrote:

 BGP sessions between the ASR 9 and each DNS server in the cluster,
 ExaBGP running on them announcing their loopback/service /32 + /128
 address(es).

 Health check scripts on each service to probe for service ability,
 retract the announcement upon failure.

 We are doing this exact same thing on many RIPE NCC DNS servers, and it
 works very well. The other advantage of BGP is that as soon as you
 withdraw the announcement, the router stops sending traffic to the
 server. With OSPF, you have timeouts of several seconds before traffic
 stops arriving at a dead server.

 Regards,

 Anand
 ___
 dns-operations mailing list
 dns-operations@lists.dns-oarc.net
 https://lists.dns-oarc.net/mailman/listinfo/dns-operations
 dns-jobs mailing list
 https://lists.dns-oarc.net/mailman/listinfo/dns-jobs

CONFIDENTIAL: This E-mail and any attachment are confidential and may contain 
reserved information. If you are not one of the  named recipients, please 
notify the sender immediately. Moreover, you should not disclose the contents 
to any other person, or should the information contained be used for any 
purpose or stored or copied in any form.

___
dns-operations mailing list
dns-operations@lists.dns-oarc.net
https://lists.dns-oarc.net/mailman/listinfo/dns-operations
dns-jobs mailing list
https://lists.dns-oarc.net/mailman/listinfo/dns-jobs


Re: [dns-operations] DNS load-balancing/failover using an ASR 9xxx (few questions)

2014-08-15 Thread Costantino Andrea (Con)
I forgot to mention that you should disable proxy ARP for connected interfaces 
on Linux otherwise you'll trigger a bug in ASR code (confirmed on 4.2.x) that 
will loop route packets back an forth on the default instead of loadsharing to 
DNS.

If anybody is interested, I can provide exact sysctl to workaround issue.

Il giorno 15/ago/2014, alle ore 11:38, Costantino Andrea (Con) 
andrea.costant...@h3g.it ha scritto:

 We do the same with Quagga or BIRD on Linux and ospf daemon for georedundancy 
 and load sharing with proximity for customer access to recursive bind 
 resolvers.

 avoiding tedious specific need in our case, we have the primary/secondary DNS 
 IPs announced as loopback by the system.
 We don't have any specific monitoring to bring OSFP down on the server since 
 we have lots of them (4 per POP) and specific scripted and human monitoring 
 24x7, so if a server has issue the customer barely notices it before the 
 human acts and bring down the server affected.
 we also had power surge in a POP that brought it offline entirely on DNS side 
 (network was on dc while problem affected ac power only for some racks), and 
 30 seconds after the service was up again using dnses of another pop. very 
 effective given the giant fail we had.

 about the timeouts, you don't need to wait if you bring down the loopbacks 
 instead of the ospf daemon. after downing the loopbacks the ospf notifies he 
 does not have those IPs anymore and upstream routers load share only on 
 remaining servers.
 then you can shut the daemon down.

 I wondered if using the probe, but found the it was an overkill in our case 
 since a simple transient hang in the network (STP issue, mismatched cabling) 
 could have brought down an entire POP for a minor event. We preferred to have 
 human monitoring instead since a 24x7 service was already there for network 
 alarms and could easily correlate with other causes or real server issue.

 We didn't had a single sw failure in more then 7 years with four different 
 installations (RHEL 3, Centos 4,5,6) in a very complex environment due to 
 efficency and law constraints (we have upstream DNS providing DNS poisoning 
 for law requirement and a shared caching for all the anycast dnses).

 Ciao,
 A.


 Il giorno 15/ago/2014, alle ore 09:46, Anand Buddhdev ana...@ripe.net ha 
 scritto:

 On 15/08/2014 00:00, Nat Morris wrote:

 BGP sessions between the ASR 9 and each DNS server in the cluster,
 ExaBGP running on them announcing their loopback/service /32 + /128
 address(es).

 Health check scripts on each service to probe for service ability,
 retract the announcement upon failure.

 We are doing this exact same thing on many RIPE NCC DNS servers, and it
 works very well. The other advantage of BGP is that as soon as you
 withdraw the announcement, the router stops sending traffic to the
 server. With OSPF, you have timeouts of several seconds before traffic
 stops arriving at a dead server.

 Regards,

 Anand
 ___
 dns-operations mailing list
 dns-operations@lists.dns-oarc.net
 https://lists.dns-oarc.net/mailman/listinfo/dns-operations
 dns-jobs mailing list
 https://lists.dns-oarc.net/mailman/listinfo/dns-jobs

CONFIDENTIAL: This E-mail and any attachment are confidential and may contain 
reserved information. If you are not one of the  named recipients, please 
notify the sender immediately. Moreover, you should not disclose the contents 
to any other person, or should the information contained be used for any 
purpose or stored or copied in any form.

___
dns-operations mailing list
dns-operations@lists.dns-oarc.net
https://lists.dns-oarc.net/mailman/listinfo/dns-operations
dns-jobs mailing list
https://lists.dns-oarc.net/mailman/listinfo/dns-jobs


Re: [dns-operations] DNS load-balancing/failover using an ASR 9xxx (few questions)

2014-08-15 Thread Marcelo Gardini do Amaral
On Fri, Aug 15, 2014 at 09:22:02AM +0200, Anand Buddhdev wrote:
 On 15/08/2014 00:00, Nat Morris wrote:
 
  BGP sessions between the ASR 9 and each DNS server in the cluster,
  ExaBGP running on them announcing their loopback/service /32 + /128
  address(es).
  
  Health check scripts on each service to probe for service ability,
  retract the announcement upon failure.
 
 We are doing this exact same thing on many RIPE NCC DNS servers, and it
 works very well. The other advantage of BGP is that as soon as you
 withdraw the announcement, the router stops sending traffic to the
 server. With OSPF, you have timeouts of several seconds before traffic
 stops arriving at a dead server.

You can tweak OSPF timers like hello and dead interval in order to
increase the responsiveness of the health check.

Cheers,

--
Marcelo Gardini
___
dns-operations mailing list
dns-operations@lists.dns-oarc.net
https://lists.dns-oarc.net/mailman/listinfo/dns-operations
dns-jobs mailing list
https://lists.dns-oarc.net/mailman/listinfo/dns-jobs


Re: [dns-operations] DNS load-balancing/failover using an ASR 9xxx (few questions)

2014-08-15 Thread Warren Kumari
On Thu, Aug 14, 2014 at 6:00 PM, Nat Morris n...@nuqe.net wrote:
 On 14 August 2014 18:48, Jake Zack jake.z...@cira.ca wrote:
 In the ASR 9xxx series with IOS XR, the “ipsla” that it has available
 doesn’t seem to do either TCP connections or UDP DNS queries.  It seems my
 only real option is to monitor for ICMP reachability and nothing else.

 Anyone have a better solution?  I’ve considered throwing a wrapper around
 BIND doing OSPF updates and such…but it seems unideal.

What seems unideal about it? It is a well know and understood
technique, relies only on open and tested core features. I'd suggest
doing BGP instead of OSPF, but much of that is personal preference...


 BGP sessions between the ASR 9 and each DNS server in the cluster,
 ExaBGP running on them announcing their loopback/service /32 + /128
 address(es).

Yup, this also only uses well know, well understood systems - with
anything like the Cisco solution you end up with vendor lock-in - and
are subject to their whims (like what Jake described). ipsla is not
part of their core features and so changes over releases / platforms.
I'm sure they'd be happy to sell you an ACE though :-)


 Health check scripts on each service to probe for service ability,
 retract the announcement upon failure.

 --
 Nat

 https://nat.ms
 +44 7531 750292

 ___
 dns-operations mailing list
 dns-operations@lists.dns-oarc.net
 https://lists.dns-oarc.net/mailman/listinfo/dns-operations
 dns-jobs mailing list
 https://lists.dns-oarc.net/mailman/listinfo/dns-jobs



-- 
I don't think the execution is relevant when it was obviously a bad
idea in the first place.
This is like putting rabid weasels in your pants, and later expressing
regret at having chosen those particular rabid weasels and that pair
of pants.
   ---maf

___
dns-operations mailing list
dns-operations@lists.dns-oarc.net
https://lists.dns-oarc.net/mailman/listinfo/dns-operations
dns-jobs mailing list
https://lists.dns-oarc.net/mailman/listinfo/dns-jobs

Re: [dns-operations] DNS load-balancing/failover using an ASR 9xxx (few questions)

2014-08-15 Thread Stephen Johnson (DIS)
On Thu, 2014-08-14 at 17:48 +, Jake Zack wrote:
 Anyone doing this?
 
  
 
 Previously I’d been using Cisco 3945’s and 3845’s running standard
 IOS…thus using Cisco IP SLA + track to do DNS queries of each server
 and add/remove them from the cluster.
 
  
 
 In the ASR 9xxx series with IOS XR, the “ipsla” that it has available
 doesn’t seem to do either TCP connections or UDP DNS queries.  It
 seems my only real option is to monitor for ICMP reachability and
 nothing else.
 
  
 
 Anyone have a better solution?  I’ve considered throwing a wrapper
 around BIND doing OSPF updates and such…but it seems unideal.
 
  
 
 -Jake
 
 DNS Administrator – CIRA (.CA TLD)
 
 

We are using a couple of small clusters of Linux Servers (Scientific
linux (whitebox RHEL distribution) for recursive resolvers. They consist
of 2 load balancers using a CMAN/Pacemaker cluster. The load balancing
is done with the Linux kernel's IP Virtual SErvice (IPVS) featire. The
resolver IPs are VIPs managed by the cluster. And the load balancers are
setup to replication their connection tables to each other to add in
seamless failover capabilities

Also in the mix I run keepalived on the load balancers. Keepalived
manages the IPVS configuration in conjunction with health checks for
each of the back-end nodes. If a back-end node stop responding, the IPVS
configuration is altered to remove that node from tthe cluster.

And note that keepalived also implements a VRRP routing daemon for
failover between a set of routers. (We don't use VRRP in our setup.)

There are 4 back-end servers running just Bind as caching name servers
with a few of our main authoritative zones as slaves.  The load
balancers have all of the back-end servers in their configurations, but
we normally only have 2 back-end nodes servicing one of the resolver
VIPs. The other two are set to weight 0. I can alter the weights in the
lod balancers to bring back-end nodes in and out of service and to move
them between resolver VIPs.

I've clocked a resolver cluster (1 Load Balancers, 2 backend nodes and
named caches flushed) north of 11,000 queries per second before it
queries started to fail.

I've been using a similar setup (minus the keepalived) for well over 7
years with out any major issues. The resolvers clusters have been
running about 3 years without any major issues.

-- 
Stephen L Johnson  stephen.john...@arkansas.gov
Unix Systems Administrator / DNS Hostmaster
Department of Information Systems
State of Arkansas
501-682-4339

___
dns-operations mailing list
dns-operations@lists.dns-oarc.net
https://lists.dns-oarc.net/mailman/listinfo/dns-operations
dns-jobs mailing list
https://lists.dns-oarc.net/mailman/listinfo/dns-jobs