Re: High recursive client counts
Our public DNS servers are on site as well. I user forwarders (as opposed to slaves) from our resolvers to our public DNS servers for our internal domains, and the resolvers still responded for internal domains, even when the recursive count was high and external domains weren't responding. On Thu, Mar 27, 2014 at 5:26 PM, Mark Andrews ma...@isc.org wrote: In message 53349e66.8050...@ksu.edu, Lawrence K. Chen, P.Eng. writes: On 03/26/14 04:02, Sam Wilson wrote: In article mailman.2530.1395774135.20661.bind-us...@lists.isc.org, Jason Brandt jbra...@fsmail.bradley.edu wrote: For now, I've disabled DNS inspection on our firewall, as it is an ancient Cisco firewall services module, and that seems to have stabilized things, but it's only been 30 minutes or so. Until I get a few days in, I'll keep researching. We used to run DNS inspection on our FWSMs. We didn't notice any issues with DNS resolution per se, but we did find that turning it off dropped the FWSM CPU from ~70% to less than 30%. We're not aware of any issues that using DNS inspection might have caused. Sam I had to get our DNS servers exempted from our Procera, as it was interfering DNSSEC. The security analyst said it considered some of the large encrypted UDPs as P2P. So, every few days (less during busy times), a recursive caching query server would stop answeringwhere restarting it would make it work again. It was to the point where I had our monitoring system restart bind as needed. Eventually, my manager asked about all strange notifications. Where he then pushed it up to the CISO to get the analyst to make the change to stop interfering with DNS. They had done a test a few months earlier, and said we didn't complain then. I went back through the logs, and found that it had been interfering then...but the weekend test wasn't enough to cause any servers to stop responding. I didn't think to see what the client counts were. Though another time when the Procera had stopped passing any traffic, the counts did get really high before they stopped working. Need to work on figuring out how to have it resolve local domains when Internet connection is down. Slave the local zones is the simplest solution. -- Who: Lawrence K. Chen, P.Eng. - W0LKC - Sr. Unix Systems Administrator For: Enterprise Server Technologies (EST) -- SafeZone Ally ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users -- Jason K. Brandt Systems Administrator Bradley University (309) 677-2958 ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: High recursive client counts
Are you using logs on the bind machine\s? Eliezer On 03/25/2014 04:31 PM, Jason Brandt wrote: We recently migrated to BIND for our internal resolvers, and since the migration, we are experiencing periods of high recursive client counts, which will at times cause the BIND server to quit responding. As a workaround, I've been able to point the BIND server to a forwarder, bypassing the root hints, to restore stability, but this morning even with the forwarder, our count spiked. We are using Ubuntu 12.04 LTS, BIND version 9.8.1-P1. The server is configured strictly as a resolver, and is not authoritative for any domains. We have approximately 15-20k client devices on campus. Our average recursive client count is between 10 and 50. When the spikes occur, counts will get upwards of 3-4k (this morning: recursive clients: 2358/9900/1). What are possible causes of high recursive client count? What can be done to prevent this or tune around it? Obviously raising the max clients doesn't solve the problem, and the forwarder seemed to help, but apparently is still susceptible to the issue. Any suggestions would be greatly appreciated. -- Jason K. Brandt Systems Administrator ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: High recursive client counts
On 03/26/14 04:02, Sam Wilson wrote: In article mailman.2530.1395774135.20661.bind-us...@lists.isc.org, Jason Brandt jbra...@fsmail.bradley.edu wrote: For now, I've disabled DNS inspection on our firewall, as it is an ancient Cisco firewall services module, and that seems to have stabilized things, but it's only been 30 minutes or so. Until I get a few days in, I'll keep researching. We used to run DNS inspection on our FWSMs. We didn't notice any issues with DNS resolution per se, but we did find that turning it off dropped the FWSM CPU from ~70% to less than 30%. We're not aware of any issues that using DNS inspection might have caused. Sam I had to get our DNS servers exempted from our Procera, as it was interfering DNSSEC. The security analyst said it considered some of the large encrypted UDPs as P2P. So, every few days (less during busy times), a recursive caching query server would stop answeringwhere restarting it would make it work again. It was to the point where I had our monitoring system restart bind as needed. Eventually, my manager asked about all strange notifications. Where he then pushed it up to the CISO to get the analyst to make the change to stop interfering with DNS. They had done a test a few months earlier, and said we didn't complain then. I went back through the logs, and found that it had been interfering then...but the weekend test wasn't enough to cause any servers to stop responding. I didn't think to see what the client counts were. Though another time when the Procera had stopped passing any traffic, the counts did get really high before they stopped working. Need to work on figuring out how to have it resolve local domains when Internet connection is down. -- Who: Lawrence K. Chen, P.Eng. - W0LKC - Sr. Unix Systems Administrator For: Enterprise Server Technologies (EST) -- SafeZone Ally ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: High recursive client counts
In article mailman.2530.1395774135.20661.bind-us...@lists.isc.org, Jason Brandt jbra...@fsmail.bradley.edu wrote: For now, I've disabled DNS inspection on our firewall, as it is an ancient Cisco firewall services module, and that seems to have stabilized things, but it's only been 30 minutes or so. Until I get a few days in, I'll keep researching. We used to run DNS inspection on our FWSMs. We didn't notice any issues with DNS resolution per se, but we did find that turning it off dropped the FWSM CPU from ~70% to less than 30%. We're not aware of any issues that using DNS inspection might have caused. Sam -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: High recursive client counts
The code on our FWSMs isn't the latest release, so that could be part of the issue, but it's been about 16 hours now since I shut it off, and so far so good. I would say though with the other load on our firewalls, it's highly possible that they were being overloaded. Unfortunately our MRTG isn't setup to track firewall CPU, so I can't say for sure. Thanks, Jason On Wed, Mar 26, 2014 at 4:02 AM, Sam Wilson sam.wil...@ed.ac.uk wrote: In article mailman.2530.1395774135.20661.bind-us...@lists.isc.org, Jason Brandt jbra...@fsmail.bradley.edu wrote: For now, I've disabled DNS inspection on our firewall, as it is an ancient Cisco firewall services module, and that seems to have stabilized things, but it's only been 30 minutes or so. Until I get a few days in, I'll keep researching. We used to run DNS inspection on our FWSMs. We didn't notice any issues with DNS resolution per se, but we did find that turning it off dropped the FWSM CPU from ~70% to less than 30%. We're not aware of any issues that using DNS inspection might have caused. Sam -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users -- Jason K. Brandt Systems Administrator ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: High recursive client counts
I had it set as: policy-map global_policy class inspection_default inspect dns maximum-length 4096 Which is what Cisco recommends. EDNS tests worked fine, but the BIND servers would still get backed up. On Wed, Mar 26, 2014 at 7:35 AM, Thom, Paul E paul.t...@ssc-spc.gc.cawrote: Do you have the FWSM DNS inspection configured to support EDNS. Not sure if I have seen ASA / PIX code causing that problem when EDNS support was not configured on the firewalls but it's something to look at. *From:* bind-users-bounces+paul.thom=dfo-mpo.gc...@lists.isc.org [mailto: bind-users-bounces+paul.thom=dfo-mpo.gc...@lists.isc.org] *On Behalf Of *Jason Brandt *Sent:* March-26-14 9:09 AM *To:* Sam Wilson *Cc:* comp-protocols-dns-b...@isc.org *Subject:* Re: High recursive client counts The code on our FWSMs isn't the latest release, so that could be part of the issue, but it's been about 16 hours now since I shut it off, and so far so good. I would say though with the other load on our firewalls, it's highly possible that they were being overloaded. Unfortunately our MRTG isn't setup to track firewall CPU, so I can't say for sure. Thanks, Jason -- Jason K. Brandt Systems Administrator -- Jason K. Brandt Systems Administrator ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Re: High recursive client counts
We don't do any NAT at the firewall level, they're all public IPs. Thanks, Jason On Wed, Mar 26, 2014 at 7:51 AM, Timothe Litt l...@acm.org wrote: DNS inspection doesn't do anything useful; bind does enough validity checking. UDP inspection suffices to let return packets thru. Another thing to beware of is NAT - if you do static NAT translation for your nameservers, be sure to specify no-payload (e.g. ip nat inside source static tcp/udp 10.0.0.1 53 16.123.213.11 53 extendable no-payload ) Otherwise, the router will try to be 'helpful' by modifying the payload - which breaks quite a few things, and not necessarily in obvious ways. Timothe Litt ACM Distinguished Engineer -- This communication may not represent the ACM or my employer's views, if any, on the matters discussed. On 26-Mar-14 05:02, Sam Wilson wrote: In article mailman.2530.1395774135.20661.bind-us...@lists.isc.org, Jason Brandt jbra...@fsmail.bradley.edu wrote: For now, I've disabled DNS inspection on our firewall, as it is an ancient Cisco firewall services module, and that seems to have stabilized things, but it's only been 30 minutes or so. Until I get a few days in, I'll keep researching. We used to run DNS inspection on our FWSMs. We didn't notice any issues with DNS resolution per se, but we did find that turning it off dropped the FWSM CPU from ~70% to less than 30%. We're not aware of any issues that using DNS inspection might have caused. Sam ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users -- Jason K. Brandt Systems Administrator ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: High recursive client counts
In article mailman.2540.1395835774.20661.bind-us...@lists.isc.org, Jason Brandt jbra...@fsmail.bradley.edu wrote: The code on our FWSMs isn't the latest release, so that could be part of the issue, but it's been about 16 hours now since I shut it off, and so far so good. I would say though with the other load on our firewalls, it's highly possible that they were being overloaded. Unfortunately our MRTG isn't setup to track firewall CPU, so I can't say for sure. Logging into your FWSM and doing 'show cpu usage' when things are going badly might be an option, but if you've got MRTG monitoring the 6500 that the FWSM is in you could also have a look at the traffic on the virtual ethernets that connect to the FWSM. Whilst they don't show up on 'show int' they and a 6 Gbps portchannel are visible to SNMP and in 'show firewall module X traffic' (or 'show firewall switch X module Y traffic' in a VSS setup).. Sam -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
RE: High recursive client counts
, -version = 'snmpv1', -port= 162 ); if (!defined($session)) { printf(ERROR: %s.\n, $error); exit 1; } my $svSvcName = '1.3.6.1.4.1.77.1.2.3.1.1'; my $message = FWSM CPU TOO HIGH $cpu%; my @oids = ($svSvcName, OCTET_STRING, $message); #my @oids; my $result = $session-trap( -agentaddr= $monitor, -varbindlist = \@oids #-varbindlist = [$svSvcName, OCTET_STRING, $message] ); if (!defined($result)) { printf(ERROR: %s.\n, $session-error); $session-close; exit 1; } $session-close; print Sent Trap \$message\ to $host\n; } #end foreach } #end sub -Original Message- From: bind-users-bounces+cc3283=att@lists.isc.org [mailto:bind-users-bounces+cc3283=att@lists.isc.org] On Behalf Of Sam Wilson Sent: Wednesday, March 26, 2014 1:29 PM To: comp-protocols-dns-b...@isc.org Subject: Re: High recursive client counts In article mailman.2540.1395835774.20661.bind-us...@lists.isc.org, Jason Brandt jbra...@fsmail.bradley.edu wrote: The code on our FWSMs isn't the latest release, so that could be part of the issue, but it's been about 16 hours now since I shut it off, and so far so good. I would say though with the other load on our firewalls, it's highly possible that they were being overloaded. Unfortunately our MRTG isn't setup to track firewall CPU, so I can't say for sure. Logging into your FWSM and doing 'show cpu usage' when things are going badly might be an option, but if you've got MRTG monitoring the 6500 that the FWSM is in you could also have a look at the traffic on the virtual ethernets that connect to the FWSM. Whilst they don't show up on 'show int' they and a 6 Gbps portchannel are visible to SNMP and in 'show firewall module X traffic' (or 'show firewall switch X module Y traffic' in a VSS setup).. Sam -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: High recursive client counts
Thanks guys. I appreciate the input. I don't want to derail the list much though, as this is supposed to be more BIND than Cisco :) At this point my BIND installation seems to be stable, so we'll call it case closed. We do plan on replacing our firewalls in the near future, so hopefully we won't need to put much more effort into it. But again appreciate all the help and suggestions, it definitely pushed me in the right direction for finding the problem. Jason On Wed, Mar 26, 2014 at 12:56 PM, CARTWRIGHT, CORY C cc3...@att.com wrote: Here is a script I wrote to log and sent traps. I'm sure you'll have to make a lot of changes but hopefully it can help you get started monitoring the FWSM. You can use this as a template to expand upon. #!/usr/bin/perl use strict; use Expect; use Net::Telnet; use Data::Dumper; use POSIX qw(tzset); use Data::Dumper; use lib qw( /usr/local/rrdtool-1.2.13/lib/perl ); use RRDs; use File::Copy; use Net::SNMP qw(:asn1); ## quick fix for gathering codec data ## not very robust !!! ## author: Cory Cartwright corycartwri...@sbcglobal.net ## ## grab cisco FWSM cpu information for RRD graphing and SNMP trap generation ## $ENV{TZ} = 'EDT'; POSIX::tzset(); my $createRRD = shift || 'false'; my $host = MY6500|7600 host; my $user = router username; my $pass = router passwd; my $fwUser = FWSM username; my $fwPasswd = FWSM password; my $comunity = FWSM comunity string; my $monitor = 'trap monitor IP'; # source that set and sent the trap my @trapCatchers = qw(array of trap catchers); my $filename = /var/voip/fwsm_logger.txt; #dump file my $DBfile = '/var/voip/codecDump.csv'; my $trapThreshold = '60'; #'60'; #five sec thresh send trap% my $procThreshold = '30'; #'30' ; #threshhold before we capture sh proc my %meas_hash = ( 'fiveSec' = 'fiveSec', 'oneMin' = 'oneMin', 'fiveMin' = 'fiveMin', ); my $rrd = '/usr/voip/bin/fwcpuRRD.rrd'; if (! -e $rrd) { $createRRD = 'true'; } my $hashRef = doExec(); if($hashRef-{'fiveSec'} = $trapThreshold) { #send trap print Sending trap\n; sendTrap($hashRef-{'fiveSec'}); } createRRD($rrd,\%meas_hash) if($createRRD eq 'true'); updateRRD($rrd,\%meas_hash,$hashRef); print struct\n . Dumper(%meas_hash); print data\n . Dumper($hashRef); copy($rrd,/var/www/voipdata/fwcpuRRD.rrd); sub doExec { my $exp = new Expect; #$exp-log_stdout(1); $exp-log_file($filename); my $command = ssh -l $fwUser $host; $exp-spawn($command) or die Could not spawn $command $!; my $string = qr/passwd/; my $return = $exp-expect(3, $string); $exp-send($pass\n); $return = $exp-expect(3, '7604-nh1'); $exp-send(session slot 3 pro 1\n); $return = $exp-expect(3, /Password:/); $exp-send(x1c2v3\n); $return = $exp-expect(3, 'sipsfw'); $exp-send(enable\n); $return = $exp-expect(3, $string); $exp-send($fwPasswd\n); $return = $exp-expect(3, 'sipsfw#'); $exp-send(sh cpu\n); $exp-expect(2); my $cpu = $exp-before(); $cpu = $exp-before(); my %cpu = (); if($cpu =~ /\d\sseconds\s=\s(\d+)\%\;\s\d\sminute\:\s(\d+)\%\;\s\d\sminutes\:\s(\d+)\%/g) { $cpu{'fiveSec'} = $1; $cpu{'oneMin'} = $2; $cpu{'fiveMin'} = $3; print Dumper(%cpu); } if($cpu{'fiveSec'} = $procThreshold) { my $timestamp = \nBEGIN: TIME: . time . !! . localtime(time) . \n### CPU 5 sec . $cpu{'fiveSec'} . \n; $exp-print_log_file($timestamp); $exp-send(no pager\n); $exp-send(sh proc\n); $exp-send(sh conn\n); $exp-send(sh resource usage\n); $exp-expect(3,'sipsfw#'); } $exp-send(exit\n); #exit enable $exp-expect(1); $exp-send(exit\n); #exit fw $exp-expect(1); $exp-send(exit\n); #exit switch $exp-expect(1); $exp-print_log_file(\nEND\n); $exp-soft_close(); return(\%cpu); } #end doExec sub updateRRD { my ($rrd,$meas_hashRef,$dataHashRef) = @_; my $epoc = time; my $data_string = ''; foreach my $cust (sort keys %$meas_hashRef) { my $data = $$dataHashRef{$$meas_hashRef{$cust}} || 0; print Cust $cust: $data \n; $data_string = $data_string . $data:; } $data_string =~ s/:$//g; print rrdtool update $rrd $epoc:$data_string\n; RRDs::updatev $rrd, $epoc .: . $data_string; if (my $ERROR = RRDs::error) { warn $0: unable to update $rrd : $ERROR; } } #end sub sub createRRD { my $starttime = time; my $step = (5 * 60); my ($rrd,$meas_hashRef) = @_; print Dumper($meas_hashRef); print In createRRD: ($starttime,$rrd,$step,$meas_hashRef)\n; my $DS_string = $rrd --start $starttime --step $step ; foreach(sort keys %{$meas_hashRef}) { print Key: $_\n; $DS_string = $DS_string . DS:$_:GAUGE:$step:U:U ;
Re: High recursive client counts
This got me to take a look at rndc recursing on one of our servers. It is disappointing that queries for the same FQDN/type/class from the same client (different source port and query ID though) are handled individually rather than being merged somehow. Is this because of the ID or the source port, both, or something else? On Wed, Mar 26, 2014 at 2:05 PM, Jason Brandt jbra...@fsmail.bradley.eduwrote: Thanks guys. I appreciate the input. I don't want to derail the list much though, as this is supposed to be more BIND than Cisco :) At this point my BIND installation seems to be stable, so we'll call it case closed. We do plan on replacing our firewalls in the near future, so hopefully we won't need to put much more effort into it. But again appreciate all the help and suggestions, it definitely pushed me in the right direction for finding the problem. Jason On Wed, Mar 26, 2014 at 12:56 PM, CARTWRIGHT, CORY C cc3...@att.comwrote: Here is a script I wrote to log and sent traps. I'm sure you'll have to make a lot of changes but hopefully it can help you get started monitoring the FWSM. You can use this as a template to expand upon. #!/usr/bin/perl use strict; use Expect; use Net::Telnet; use Data::Dumper; use POSIX qw(tzset); use Data::Dumper; use lib qw( /usr/local/rrdtool-1.2.13/lib/perl ); use RRDs; use File::Copy; use Net::SNMP qw(:asn1); ## quick fix for gathering codec data ## not very robust !!! ## author: Cory Cartwright corycartwri...@sbcglobal.net ## ## grab cisco FWSM cpu information for RRD graphing and SNMP trap generation ## $ENV{TZ} = 'EDT'; POSIX::tzset(); my $createRRD = shift || 'false'; my $host = MY6500|7600 host; my $user = router username; my $pass = router passwd; my $fwUser = FWSM username; my $fwPasswd = FWSM password; my $comunity = FWSM comunity string; my $monitor = 'trap monitor IP'; # source that set and sent the trap my @trapCatchers = qw(array of trap catchers); my $filename = /var/voip/fwsm_logger.txt; #dump file my $DBfile = '/var/voip/codecDump.csv'; my $trapThreshold = '60'; #'60'; #five sec thresh send trap% my $procThreshold = '30'; #'30' ; #threshhold before we capture sh proc my %meas_hash = ( 'fiveSec' = 'fiveSec', 'oneMin' = 'oneMin', 'fiveMin' = 'fiveMin', ); my $rrd = '/usr/voip/bin/fwcpuRRD.rrd'; if (! -e $rrd) { $createRRD = 'true'; } my $hashRef = doExec(); if($hashRef-{'fiveSec'} = $trapThreshold) { #send trap print Sending trap\n; sendTrap($hashRef-{'fiveSec'}); } createRRD($rrd,\%meas_hash) if($createRRD eq 'true'); updateRRD($rrd,\%meas_hash,$hashRef); print struct\n . Dumper(%meas_hash); print data\n . Dumper($hashRef); copy($rrd,/var/www/voipdata/fwcpuRRD.rrd); sub doExec { my $exp = new Expect; #$exp-log_stdout(1); $exp-log_file($filename); my $command = ssh -l $fwUser $host; $exp-spawn($command) or die Could not spawn $command $!; my $string = qr/passwd/; my $return = $exp-expect(3, $string); $exp-send($pass\n); $return = $exp-expect(3, '7604-nh1'); $exp-send(session slot 3 pro 1\n); $return = $exp-expect(3, /Password:/); $exp-send(x1c2v3\n); $return = $exp-expect(3, 'sipsfw'); $exp-send(enable\n); $return = $exp-expect(3, $string); $exp-send($fwPasswd\n); $return = $exp-expect(3, 'sipsfw#'); $exp-send(sh cpu\n); $exp-expect(2); my $cpu = $exp-before(); $cpu = $exp-before(); my %cpu = (); if($cpu =~ /\d\sseconds\s=\s(\d+)\%\;\s\d\sminute\:\s(\d+)\%\;\s\d\sminutes\:\s(\d+)\%/g) { $cpu{'fiveSec'} = $1; $cpu{'oneMin'} = $2; $cpu{'fiveMin'} = $3; print Dumper(%cpu); } if($cpu{'fiveSec'} = $procThreshold) { my $timestamp = \nBEGIN: TIME: . time . !! . localtime(time) . \n### CPU 5 sec . $cpu{'fiveSec'} . \n; $exp-print_log_file($timestamp); $exp-send(no pager\n); $exp-send(sh proc\n); $exp-send(sh conn\n); $exp-send(sh resource usage\n); $exp-expect(3,'sipsfw#'); } $exp-send(exit\n); #exit enable $exp-expect(1); $exp-send(exit\n); #exit fw $exp-expect(1); $exp-send(exit\n); #exit switch $exp-expect(1); $exp-print_log_file(\nEND\n); $exp-soft_close(); return(\%cpu); } #end doExec sub updateRRD { my ($rrd,$meas_hashRef,$dataHashRef) = @_; my $epoc = time; my $data_string = ''; foreach my $cust (sort keys %$meas_hashRef) { my $data = $$dataHashRef{$$meas_hashRef{$cust}} || 0; print Cust $cust: $data \n; $data_string = $data_string . $data:; } $data_string =~ s/:$//g; print rrdtool update $rrd $epoc:$data_string\n; RRDs::updatev $rrd, $epoc .: . $data_string; if (my $ERROR = RRDs::error) { warn $0: unable to update $rrd : $ERROR; } } #end sub sub createRRD { my
High recursive client counts
We recently migrated to BIND for our internal resolvers, and since the migration, we are experiencing periods of high recursive client counts, which will at times cause the BIND server to quit responding. As a workaround, I've been able to point the BIND server to a forwarder, bypassing the root hints, to restore stability, but this morning even with the forwarder, our count spiked. We are using Ubuntu 12.04 LTS, BIND version 9.8.1-P1. The server is configured strictly as a resolver, and is not authoritative for any domains. We have approximately 15-20k client devices on campus. Our average recursive client count is between 10 and 50. When the spikes occur, counts will get upwards of 3-4k (this morning: recursive clients: 2358/9900/1). What are possible causes of high recursive client count? What can be done to prevent this or tune around it? Obviously raising the max clients doesn't solve the problem, and the forwarder seemed to help, but apparently is still susceptible to the issue. Any suggestions would be greatly appreciated. -- Jason K. Brandt Systems Administrator ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: High recursive client counts
Hi Jason, I've experienced similar things in the past on 9.8. Since then we've moved to the latest 9.9, but don't think this is at all version specific (that said, you could obviously try upgrading). I don't have an exact solution for you, but some ideas of things to check and personal experiences which might help you. Are the servers in question VM or bare metal? Several years back we made a big push to virtualize everything, and after migrating recursive DNS it worked great for awhile...as sites grew we hit a tipping point where VM-based resolvers seemed to introduce additional query latency. These servers were running far below BIND's capabilities, not taxing virtual resources, optimized per all available BIND/OS/virtualization knobs, and using enterprise (read: not just the latest free bits slapped together and expected to work) network, server and hypervisor tech. I spent several months trying to improve the situation and find a real root cause, but on a whim I setup an identical cluster on bare metal...no more problems. I didn't have time to dig further, so we avoid virtualization on busy resolvers (for now at least). As your client count has grown...is there any bottlenecks on your network that might be unaccounted for? Beyond bandwidth I'm thinking of things like resource constrained firewalls (are the resolvers in a DMZ?) which could cause queries to be dropped/timed out/retried, etc? I've seen issues where overworked NetOps teams got behind in capacity planning/upgrades and as clients/#DMZs grew firewalls couldn't keep up and created all sorts of issues not related to BIND itself. When the recursive client count backs up, you know more queries than usual are taking longer than expected to get answers...if this is not related to BIND itself, your servers, or the network...a bit of spelunking is in order. Capture some packets with tcpdump, and take a look at rndc recursing output. Take a look at the queries causing delays, dig them manually from various locations, and try to find a common theme. If there is no common theme to the query destinations, then look even closer at your network. :-) hth -Original Message- From: Jason Brandt jbra...@fsmail.bradley.edu Date: Tuesday, March 25, 2014 at 10:31 AM To: bind-users@lists.isc.org bind-users@lists.isc.org Subject: High recursive client counts We recently migrated to BIND for our internal resolvers, and since the migration, we are experiencing periods of high recursive client counts, which will at times cause the BIND server to quit responding. As a workaround, I've been able to point the BIND server to a forwarder, bypassing the root hints, to restore stability, but this morning even with the forwarder, our count spiked. We are using Ubuntu 12.04 LTS, BIND version 9.8.1-P1. The server is configured strictly as a resolver, and is not authoritative for any domains. We have approximately 15-20k client devices on campus. Our average recursive client count is between 10 and 50. When the spikes occur, counts will get upwards of 3-4k (this morning: recursive clients: 2358/9900/1). What are possible causes of high recursive client count? What can be done to prevent this or tune around it? Obviously raising the max clients doesn't solve the problem, and the forwarder seemed to help, but apparently is still susceptible to the issue. Any suggestions would be greatly appreciated. -- Jason K. Brandt Systems Administrator ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: High recursive client counts
Mike, I appreciate your insight here. We are indeed on virtual systems, using enterprise grade hardware as well. I will be doing more investigation today, to see if I can duplicate the behavior, which I have been able to do recently. Your VM vs Physical point is the thing that got me head scratching. As I stated, this is a new system, replacing our old resolvers; however, even though I've had 2 different types of software doing resolution on our old servers, they were actual physical machines. Load in VMWare monitoring shows what you'd normally expect, that the system isn't being taxed heavily, network usage is fairly low. To us, it seems like an application configuration issue. I could definitely see it being a VM issues of some sort too though, with the strange way it's behaving. I'll keep digging and debugging, to see if I can come up with more detail and correlate results to try and come up with a common theme/cause. Thank you for your help. On Tue, Mar 25, 2014 at 10:52 AM, Mike Hoskins (michoski) micho...@cisco.com wrote: Hi Jason, I've experienced similar things in the past on 9.8. Since then we've moved to the latest 9.9, but don't think this is at all version specific (that said, you could obviously try upgrading). I don't have an exact solution for you, but some ideas of things to check and personal experiences which might help you. Are the servers in question VM or bare metal? Several years back we made a big push to virtualize everything, and after migrating recursive DNS it worked great for awhile...as sites grew we hit a tipping point where VM-based resolvers seemed to introduce additional query latency. These servers were running far below BIND's capabilities, not taxing virtual resources, optimized per all available BIND/OS/virtualization knobs, and using enterprise (read: not just the latest free bits slapped together and expected to work) network, server and hypervisor tech. I spent several months trying to improve the situation and find a real root cause, but on a whim I setup an identical cluster on bare metal...no more problems. I didn't have time to dig further, so we avoid virtualization on busy resolvers (for now at least). As your client count has grown...is there any bottlenecks on your network that might be unaccounted for? Beyond bandwidth I'm thinking of things like resource constrained firewalls (are the resolvers in a DMZ?) which could cause queries to be dropped/timed out/retried, etc? I've seen issues where overworked NetOps teams got behind in capacity planning/upgrades and as clients/#DMZs grew firewalls couldn't keep up and created all sorts of issues not related to BIND itself. When the recursive client count backs up, you know more queries than usual are taking longer than expected to get answers...if this is not related to BIND itself, your servers, or the network...a bit of spelunking is in order. Capture some packets with tcpdump, and take a look at rndc recursing output. Take a look at the queries causing delays, dig them manually from various locations, and try to find a common theme. If there is no common theme to the query destinations, then look even closer at your network. :-) hth -Original Message- From: Jason Brandt jbra...@fsmail.bradley.edu Date: Tuesday, March 25, 2014 at 10:31 AM To: bind-users@lists.isc.org bind-users@lists.isc.org Subject: High recursive client counts We recently migrated to BIND for our internal resolvers, and since the migration, we are experiencing periods of high recursive client counts, which will at times cause the BIND server to quit responding. As a workaround, I've been able to point the BIND server to a forwarder, bypassing the root hints, to restore stability, but this morning even with the forwarder, our count spiked. We are using Ubuntu 12.04 LTS, BIND version 9.8.1-P1. The server is configured strictly as a resolver, and is not authoritative for any domains. We have approximately 15-20k client devices on campus. Our average recursive client count is between 10 and 50. When the spikes occur, counts will get upwards of 3-4k (this morning: recursive clients: 2358/9900/1). What are possible causes of high recursive client count? What can be done to prevent this or tune around it? Obviously raising the max clients doesn't solve the problem, and the forwarder seemed to help, but apparently is still susceptible to the issue. Any suggestions would be greatly appreciated. -- Jason K. Brandt Systems Administrator -- Jason K. Brandt Systems Administrator ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: High recursive client counts
Cathy, Thank you for your comments. I will continue to investigate, it helps to have avenues to look down though. As far as build version, we are aware that we aren't at current stable release. However we've tried to stick to the distro release as much as possible, to help streamline patching. But if this continues to be an issue, it's something we will definitely consider. The thing that's strange to me, is that we can mostly alleviate the symptoms, by using a forwarder. Currently I'm using an internal Windows 2003 server in the same subnet, on the same switch, to forward through, however I was previously using 8.8.8.8, and it was behaving well too. It seems to happen worst when simply using the root hints. Rndc recursing doesn't seem to be much help. The queries are all over, including google, adobe, amazon, microsoft, etc, as a combination of A//PTR/TXT records, from a variety of different clients on different subnets and in different firewall zones. At a glance, I don't see any correlation. Again, I'll keep investigating, and appreciate all the input! Jason On Tue, Mar 25, 2014 at 12:34 PM, Cathy Almond cat...@isc.org wrote: Packet tracing and/or looking at rndc recursing is good - then you'll see which client queries are waiting for answers from authoritative servers. Depending on what you've upgraded from, this might be a problem with whether or not your infrastructure can handle EDNS0 and large packet sizes. Newer version of BIND set the DO bit by default on the iterative queries, so perhaps some servers are sending back larger response than you were receiving before. It's worth checking that your network infrastructure can handle both EDNS0 and large UDP packet sizes (and DNS queries via TCP of course too). See https://www.dns-oarc.net/oarc/services/replysizetest I should also comment that the distro BIND 9.8 that you're using isn't the current ISC version, so you're missing-out on recent fixes - you might be better off with a self-build of 9.8.7-W1 or 9.8.5-W1: http://www.isc.org/downloads/ These also might be helpful: https://kb.isc.org/article/AA-00771/46/Which-version-of-BIND-do-I-want-to-download-and-install.html https://kb.isc.org/article/AA-00768/46/Getting-started-with-BIND-how-to-build-and-run-named-with-a-basic-recursive-configuration.html HTH Cathy ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users -- Jason K. Brandt Systems Administrator ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: High recursive client counts
Mark, That's a very good question, and something we had thought of as a possibility as well. I hadn't seen any good information in relation to entropy, so I'll check into your link. We had noticed that on other things as well, due to the virtual environment, but nothing that caused performance issues. I'm not sure how bind uses randoms, but I know it is a requirement. Perhaps someone else knows? From what I saw it seemed to be used primarily for signing zones. For now, I've disabled DNS inspection on our firewall, as it is an ancient Cisco firewall services module, and that seems to have stabilized things, but it's only been 30 minutes or so. Until I get a few days in, I'll keep researching. Again, thanks all. Your input and help is greatly appreciated. On Tue, Mar 25, 2014 at 1:31 PM, Mark Elkins m...@posix.co.za wrote: This might be a dumb answer but as the machine is part of a virtual server, perhaps you have simply run out of entropy? I know its a Resolver... but isn't perhaps BIND using Entropy to randomly talk on different ports to get answers? What about installing the 'haveged' package, www.irisa.fr/caps/projects/hipsor I don't see this doing any harm. I've personally found that not doing this on Virtual machines just makes them 'choke up'. -- . . ___. .__ Posix Systems - (South) Africa /| /| / /__ m...@posix.co.za - Mark J Elkins, Cisco CCIE / |/ |ARK \_/ /__ LKINS Tel: +27 12 807 0590 Cell: +27 82 601 0496 -- Jason K. Brandt Systems Administrator ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users