Re: Notify storms

2010-01-21 Thread Todd
I agree - we are removing 1/2 the masters in a couple weeks to help
things.  Slaves only talk to masters, there are no slaves of slaves
as we refer to them.  Our architecture goal has been uniformity among
the configurations, and this is part of the price we pay for that.

At a functional level, this will persist, just not with as much
traffic.  As a result of our use of CVS, and our inability to control
notifies, whenever we push out big updates, the first master takes all
the traffic, while the others sit there unused.  We've experimented
with breaking up the notifies by pushing updates in chunks to various
masters, but that really breaks things, both process-wise and
logically.

What we really need is some way to say there is an update, any one of
these servers has an acceptable version of the update instead of
hey, there's an update hey, there's an update hey there's an
update and having each slave go to each of the masters.  While it
seems trivial operationally to handle these loads, and we're not
concerned about network bandwidth, a single master can only manage so
many transactions at once.  I doubt we're even in the top 50% of
deployments, zone count wise, so I am confident that our number of
zones isn't an issue.  But I suspect that 80+ slaves are a little out
there.  with 80 slaves and 1800 zones, each master sends out 144000
notifies (for major changes/a master reload), which triggers 144000
SOA queries back to the master very quickly.  That is bound to cause
delays.

One option we've considered is making our MASTERS and NS records point
to an anycast IP/load balancer so that there can be multiple masters
answering for the same notify.  Another option would be to stop all
notifies altogether, then figure out a way to manually trigger
notifies (generating notifys via perl script/something clever) so we
can control where the notifies come from.  When all the EU DNS servers
get notifies first from an NA master, they grab the data from there,
so being able to control notifies would be nice sometimes.

Thankfully we're mid-rearchitecture, and this will (hopefully) be torn
out soon, but until it is we need to make sure that our users can
manage their changes in a reasonable manner.  A for loop doing rndc
retransfer for changed zones, which seems to bypass all the
congestion, is a short term fix until we can figure out how to make
things a little smoother.

Apologies for the wall of text - this is a frequent discussion with
very little in the way of conclusion around here :)

Todd.




On Wed, Jan 20, 2010 at 10:33 PM, Joseph S D Yao j...@tux.org wrote:
 On Wed, Jan 20, 2010 at 03:52:33PM -0500, Todd wrote:
  serial-query-rate

 While this appears to be helping in the lab, it's still taking between
 2 and 3 minutes for each slave to even finish receiving the NOTIFYs
 from the master.  They then start hitting the master(s) with SOA
 queries whch seems to take a really long time.


 Your NOTIFY tree sounds like it's many-to-many.  Maybe you should be
 using a sparser tree.


 --
 /*\
 **
 ** Joe Yao                              j...@tux.org - Joseph S. D. Yao
 **
 \*/

___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Notify storms

2010-01-20 Thread Matthew Pounsett

On 2010/01/20, at 13:03, Dave Sparro wrote:

 We would like to make this better.
 Can anyone help with ideas on this?  Are we missing something obvious?
   
 
 In that situation I'd consider using CVS on all of the servers to maintain 
 the DNS data.
 Just make all of the servers masters, and forget about slaves.

Agreed .. that's definitely one solution.  With your data already in a version 
control system, and that many name servers, you might benefit from replacing 
zone transfers with a configuration management tool (cfengine, bcfg2, etc.) 
which can take care of noticing that there's new data in the version control 
system, getting it onto the slaves, and then telling them to reload or reconfig 
as appropriate (depending on whether it's zone files or named.conf that 
changed).


Another option if you want to stick with the master/slave approach is to tier 
your slaves.   Reduce the masters to just two or three, and then assign 10 or 
so of the slaves to be intermediate masters.  The intermediates slave from the 
real masters, and then every other server slaves from, at most, two or three of 
the intermediates each.  If you group these appropriately, then you can get it 
down to a maximum of 10 or so slaves talking to any one upstream master, with a 
nice mesh to maintain redundancies.  How you divide them up is up to you ... 
regionally works well though.

Matt


___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Notify storms

2010-01-20 Thread Todd
 serial-query-rate


While this appears to be helping in the lab, it's still taking between
2 and 3 minutes for each slave to even finish receiving the NOTIFYs
from the master.  They then start hitting the master(s) with SOA
queries whch seems to take a really long time.

We're going to keep tuning, but it looks like we've reached some sort
of tipping point where inefficiencies in our methodology, architecture
and the underlying protocol might be combining to make for less than
ideal conditions for fast changes.

Thanks for this tip ... big 'ah-ha' moment for us.

Cheers,

Todd.
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Notify storms

2010-01-20 Thread Joseph S D Yao
On Wed, Jan 20, 2010 at 03:52:33PM -0500, Todd wrote:
  serial-query-rate
 
 While this appears to be helping in the lab, it's still taking between
 2 and 3 minutes for each slave to even finish receiving the NOTIFYs
 from the master.  They then start hitting the master(s) with SOA
 queries whch seems to take a really long time.


Your NOTIFY tree sounds like it's many-to-many.  Maybe you should be
using a sparser tree.


--
/*\
**
** Joe Yao  j...@tux.org - Joseph S. D. Yao
**
\*/
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Notify storms

2010-01-18 Thread Todd
Good day all,

We've run into a problem with our DNS servers.  The way we update our
masters is via a CVS Checkout and reload of the zones modified.
Sometimes though, we need to reload the whole config for big
changs/etc.  When that happens, all 6 masters (I know, we're getting
rid of some) send notifies to all 80+ (I know, we're getting rid of
some) slaves for all 1800 zones.  This causes all the slaves to verify
all 1800 zones on 6 masters, which then delays the changes we made
from actually getting to the slaves.  Right now it's about 2.5 hours
for all slaves to do all zones.

We would like to make this better.  We're trying to figure out what
mechanism might be limiting the rate at which the slave does SOA
checks against the master so it can perform that step quicker.  We
have looked at the zone transfer limits on the master/slave, but that
is related to the transfer mechanism, not the SOA query.

Can anyone help with ideas on this?  Are we missing something obvious?

Cheers,

Todd.
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Notify storms

2010-01-18 Thread Bryan Irvine
On Mon, Jan 18, 2010 at 1:27 PM, Todd canada...@gmail.com wrote:
 Good day all,

 We've run into a problem with our DNS servers.  The way we update our
 masters is via a CVS Checkout and reload of the zones modified.
 Sometimes though, we need to reload the whole config for big
 changs/etc.  When that happens, all 6 masters (I know, we're getting
 rid of some) send notifies to all 80+ (I know, we're getting rid of
 some) slaves for all 1800 zones.  This causes all the slaves to verify
 all 1800 zones on 6 masters, which then delays the changes we made
 from actually getting to the slaves.  Right now it's about 2.5 hours
 for all slaves to do all zones.

 We would like to make this better.  We're trying to figure out what
 mechanism might be limiting the rate at which the slave does SOA
 checks against the master so it can perform that step quicker.  We
 have looked at the zone transfer limits on the master/slave, but that
 is related to the transfer mechanism, not the SOA query.

 Can anyone help with ideas on this?  Are we missing something obvious?

Might not be what you are looking for but sounds like some of the
ideas presented at infrastructures.org might help.

-B
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Notify storms

2010-01-18 Thread Mark Andrews

In message 91aa34af1001181327q7f5de882vf47052ed39d87...@mail.gmail.com, Todd 
writes:
 Good day all,
 
 We've run into a problem with our DNS servers.  The way we update our
 masters is via a CVS Checkout and reload of the zones modified.
 Sometimes though, we need to reload the whole config for big
 changs/etc.  When that happens, all 6 masters (I know, we're getting
 rid of some) send notifies to all 80+ (I know, we're getting rid of
 some) slaves for all 1800 zones.  This causes all the slaves to verify
 all 1800 zones on 6 masters, which then delays the changes we made
 from actually getting to the slaves.  Right now it's about 2.5 hours
 for all slaves to do all zones.
 
 We would like to make this better.  We're trying to figure out what
 mechanism might be limiting the rate at which the slave does SOA
 checks against the master so it can perform that step quicker.  We
 have looked at the zone transfer limits on the master/slave, but that
 is related to the transfer mechanism, not the SOA query.
 
 Can anyone help with ideas on this?  Are we missing something obvious?

serial-query-rate

 
 Cheers,
 
 Todd.
 ___
 bind-users mailing list
 bind-users@lists.isc.org
 https://lists.isc.org/mailman/listinfo/bind-users
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org
___
bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users