Re: Need to improve named performance
Hi there, On Sun, 11 Nov 2012, Ed LaFrance wrote: Running BIND 9.3.6-P1-RedHat-9.3.6-16.P1.el5 ... Somebody already said upgrade. Generally that's the first thing to do in a case like this (before asking on mailing lists:). The issue is that named is not keeping up with rdns requests. The nameserver is only doing rdns, and it's the only public process on the server (no webhosting, monitoring, etc). When I check the router above this server I'll see 200 - 500 legitimate connections to this server at any given time. ... I'm not convinced that BIND is the problem. What does 'top' tell you? Are you running netfilter/iptables on the box? Might be ip_conntrack. I once had an issue with a lot of dropped TCP connections, each of which was hanging around for five days (the default). They filled the connection tracking table. The default is too long, ridiculously so. After I reduced it to something more reasonable the problem went away. -- 73, Ged. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Lots of RSA_verify failed after upgrade to 9.7.7
But not for 9.7, since 9.7 is EOL since november 2012. Correct? Yes, that's correct. If you're stuck on 9.7 for the time being, you can silence the RSA_verify warnings with the change I mentioned in http://www.mail-archive.com/bind-users@lists.isc.org/msg14747.html (It's not the fix we used for the maintenance release, but it'll serve.) -- Evan Hunt -- e...@isc.org Internet Systems Consortium, Inc. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Need to improve named performance
On 11/10/2012 1:39 PM, Ed LaFrance wrote: Hello all - First post to this list, hope I'm on the right place. Running BIND 9.3.6-P1-RedHat-9.3.6-16.P1.el5 on a quadcore xeon server (3Ghz) with 2GB RAM. Named is being used only for rDNS queries against our address space. The issue is that named is not keeping up with rdns requests. The nameserver is only doing rdns, and it's the only public process on the server (no webhosting, monitoring, etc). When I check the router above this server I'll see 200 - 500 legitimate connections to this server at any given time. This is what's happening: named is not keeping up with the requests, so the network receive queue fills up - I can see this with netstat: netstat -tulpn | grep :53 Proto Recv-Q Send-Q Local Address Foreign Address PID/Program name ... udp 110048 0 xxx.xxx.xxx.xxx:53 0.0.0.0:* 3918/named udp 110048 0 xxx.xxx.xxx.xxx:53 0.0.0.0:* 3918/named (two different IPs are on this machine to handle rDNS reqeusts) Once the queue gets near the max value set by sysctl, udp packets start to drop - this can also be seen in netstat: netstat -su ... Udp: 5157567 packets received 9761 packets to unknown port received. 1164232 packet receive errors 5157554 packets sent The errors apparently correspond to drops; the only increase when the queue is full. Of course by this point dns queries are timing out. I've tried increasing the queue size with sysctl using this command: sysctl -w net.core.rmem_max=1048576 net.core.rmem_default=10485 then restarting named; that did eliminate the drops, but the queue grows gigantic and I get pretty much 100% dns lookup timeouts at that point. The server loading is about 2.0 - busy, not not overwhelmed, I can run a shell or even a gui session on it with ease so it's by no means maxed out. Here's the first slice of top output: top - 09:13:38 up 18:40, 1 user, load average: 2.09, 2.05, 2.00 Tasks: 175 total, 1 running, 174 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2%us, 0.2%sy, 0.0%ni, 74.8%id, 24.7%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 2074984k total, 1743584k used, 331400k free, 166588k buffers Swap: 4128760k total, 28k used, 4128732k free, 1270032k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 4509 named 24 0 71004 4580 2036 S 1.3 0.2 0:46.74 named 6877 root 15 0 2428 1064 788 R 0.7 0.1 0:00.04 top 467 root 10 -5 000 D 0.3 0.0 2:59.13 kjournald 2460 root 18 0 1816 584 484 D 0.3 0.0 3:30.35 syslogd 1 root 15 0 2160 644 556 S 0.0 0.0 0:01.08 init The bottom line is: I need to improve named performance. Tcpdump only shows about 20 requests per second on average, I would estimate. This should be handled easily, but instead it's gagging on it and the requests are stacking up. If you have any ideas, I welcome your input. Here's named.conf, it's pretty basic for the global config, the data for each zone is stored separately elsewhere: options { directory /var; auth-nxdomain no; pid-file /var/run/named/named.pid; allow-recursion { localnets; }; allow-transfer { none; }; }; key rndc-key { algorithm hmac-md5; secret xx; }; controls { inet 127.0.0.1 port 953 allow { 127.0.0.1; } keys { rndc-key; }; }; zone . { type hint; file named.root; }; zone 0.0.127.IN-ADDR.ARPA { type master; file localhost.rev; }; I wouldn't expect a nameserver process on Linux, hosting only a few reverse zones and doing nothing else, to be 71 megabytes in size; I just checked one of ours, serving *all* of our internal zone data, forward and reverse authoritative, plus some cached data for a significant number of zones delegated to business partners, and it's less than 100 Mb in size. Verify from your query logs, or by dumping cache, that it's *only* doing what it is supposed to do, and no more. If you've got a bunch of data in your cache, or a bunch of queries, that's unrelated to serving your reverse DNS, then that's probably the root cause of your problem. Consider turning off recursion, or severely limiting it, in order to enforce that the nameserver is only serving its intended purpose. 2Gb of memory is a little lean for a nameserver serving a *generic* Internet-name-lookup role... I guess another possibility is that you've gone crazy with your reverse zones (e.g. using $GENERATE willy-nilly), and thus are using up way more memory than you really need, to serve your reverse-resolution needs. - Kevin ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org
Re: Need to improve named performance
* Ed LaFrance: Running BIND 9.3.6-P1-RedHat-9.3.6-16.P1.el5 on a quadcore xeon server (3Ghz) with 2GB RAM. Named is being used only for rDNS queries against our address space. You should really upgrade to the latest version on that branch (likely bind-9.3.6-20.P1.el5_8.5). The bottom line is: I need to improve named performance. Tcpdump only shows about 20 requests per second on average, I would estimate. This should be handled easily, but instead it's gagging on it and the requests are stacking up. Something is stalling the named process. Try to run strace -T -f -p 4509 (4509 is the PID for the named process) and see where named spends its time. The top output you quoted suggests that the process is not spinning in user space. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Need to improve named performance
Hello - Thanks for chiming in. Named is PID 8349 in my case. Here's a snippet of the output from strace: [pid 8351] time( unfinished ... [pid 8352] ... sendmsg resumed ) = 56 0.000104 [pid 8352] recvmsg(515, {msg_name(16)={sa_family=AF_INET, sin_port=htons(38385), sin_addr=inet_addr(205.188.158.143)}, msg_iov(1)=[{Q\0\0\0\1\0\0\0\0\0\1\003157\003161\00272\00264\7in-ad..., 4096}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=0}, 0) = 55 0.31 [pid 8351] ... time resumed NULL)= 1352668045 0.000353 [pid 8352] futex(0x9b6aecc, FUTEX_WAIT_PRIVATE, 2, NULL unfinished ... [pid 8351] stat64(/etc/localtime, {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 0.000109 [pid 8351] stat64(/etc/localtime, {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 0.86 [pid 8351] stat64(/etc/localtime, {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 0.84 [pid 8351] send(3, 30Nov 11 13:07:25 named[8349]:..., 107, MSG_NOSIGNAL) = 107 0.015232 [pid 8351] futex(0x9b6aecc, FUTEX_WAKE_PRIVATE, 1 unfinished ... [pid 8353] ... futex resumed ) = 0 0.052813 [pid 8351] ... futex resumed ) = 1 0.000125 [pid 8353] time(NULL) = 1352668045 0.20 [pid 8353] stat64(/etc/localtime, {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 0.25 [pid 8353] stat64(/etc/localtime, {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 0.22 [pid 8351] sendmsg(513, {msg_name(16)={sa_family=AF_INET, sin_port=htons(38162), sin_addr=inet_addr(205.188.158.207)}, msg_iov(1)=[{@%\204\0\0\1\0\1\0\2\0\1\003249\00221\003140\003204\7in-a..., 138}], msg_controllen=0, msg_flags=0}, 0 unfinished ... [pid 8353] stat64(/etc/localtime, unfinished ... [pid 8351] ... sendmsg resumed ) = 138 0.48 [pid 8353] ... stat64 resumed {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 0.41 [pid 8351] recvmsg(513, unfinished ... [pid 8353] send(3, 30Nov 11 13:07:25 named[8349]:..., 103, MSG_NOSIGNAL unfinished ... [pid 8351] ... recvmsg resumed {msg_name(16)={sa_family=AF_INET, sin_port=htons(53507), sin_addr=inet_addr(205.188.158.206)}, msg_iov(1)=[{\244\273\0\0\0\1\0\0\0\0\0\1\003246\003161\00272\00264\7in-ad..., 4096}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=0}, 0) = 55 0.86 [pid 8351] futex(0x9b6aecc, FUTEX_WAIT_PRIVATE, 2, NULL unfinished ... [pid 8353] ... send resumed )= 103 0.015034 [pid 8353] futex(0x9b6aecc, FUTEX_WAKE_PRIVATE, 1) = 1 0.25 [pid 8350] ... futex resumed ) = 0 0.051772 [pid 8350] time( unfinished ... [pid 8353] sendmsg(513, {msg_name(16)={sa_family=AF_INET, sin_port=htons(60702), sin_addr=inet_addr(64.12.139.17)}, msg_iov(1)=[{\343F\204\0\0\1\0\1\0\2\0\1\003251\003160\00272\00264\7in-ad..., 151}], msg_controllen=0, msg_flags=0}, 0 unfinished ... [pid 8350] ... time resumed NULL)= 1352668045 0.000210 [pid 8353] ... sendmsg resumed ) = 151 0.84 [pid 8350] stat64(/etc/localtime, unfinished ... [pid 8353] recvmsg(513, unfinished ... [pid 8350] ... stat64 resumed {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 0.85 [pid 8353] ... recvmsg resumed {msg_name(16)={sa_family=AF_INET, sin_port=htons(3794), sin_addr=inet_addr(64.12.139.19)}, msg_iov(1)=[{|\354\0\0\0\1\0\0\0\0\0\1\00230\003160\00272\00264\7in-add..., 4096}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=0}, 0) = 54 0.000150 [pid 8350] stat64(/etc/localtime, unfinished ... [pid 8353] futex(0x9b6aecc, FUTEX_WAIT_PRIVATE, 2, NULL unfinished ... [pid 8350] ... stat64 resumed {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 0.76 [pid 8350] stat64(/etc/localtime, {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0 0.29 [pid 8350] send(3, 30Nov 11 13:07:25 named[8349]:..., 102, MSG_NOSIGNAL unfinished ... On 11/11/2012 1:46 PM, Florian Weimer wrote: * Ed LaFrance: Running BIND 9.3.6-P1-RedHat-9.3.6-16.P1.el5 on a quadcore xeon server (3Ghz) with 2GB RAM. Named is being used only for rDNS queries against our address space. You should really upgrade to the latest version on that branch (likely bind-9.3.6-20.P1.el5_8.5). The bottom line is: I need to improve named performance. Tcpdump only shows about 20 requests per second on average, I would estimate. This should be handled easily, but instead it's gagging on it and the requests are stacking up. Something is stalling the named process. Try to run strace -T -f -p 4509 (4509 is the PID for the named process) and see where named spends its time. The top output you quoted suggests that the process is not spinning in user space. -- (800) 362-7579 ext 1 +---+ + ColocationDedicated Servers IPv4 IPv6 Transit + +---+ Connex Internet Services, Inc. direct: (916) 265-1568 11230 Gold Express Dr #310-313
Re: bind-users Digest, Vol 1361, Issue 2
Did not get your post for some reason. I am running IP tables with a simple firewall setup. No idea on ip_conntrack. How do I check and if so, what setting should I try and how do I do it? Thanks! Ed -- Message: 1 Date: Sun, 11 Nov 2012 12:41:53 + (GMT) From: G.W. Haywoodb...@jubileegroup.co.uk To:bind-users@lists.isc.org Subject: Re: Need to improve named performance Message-ID: pine.lnx.4.64.121236160.19...@mail5.jubileegroup.co.uk Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Hi there, On Sun, 11 Nov 2012, Ed LaFrance wrote: Running BIND 9.3.6-P1-RedHat-9.3.6-16.P1.el5 ... Somebody already said upgrade. Generally that's the first thing to do in a case like this (before asking on mailing lists:). The issue is that named is not keeping up with rdns requests. The nameserver is only doing rdns, and it's the only public process on the server (no webhosting, monitoring, etc). When I check the router above this server I'll see 200 - 500 legitimate connections to this server at any given time. ... I'm not convinced that BIND is the problem. What does 'top' tell you? Are you running netfilter/iptables on the box? Might be ip_conntrack. I once had an issue with a lot of dropped TCP connections, each of which was hanging around for five days (the default). They filled the connection tracking table. The default is too long, ridiculously so. After I reduced it to something more reasonable the problem went away. -- 73, Ged. -- (800) 362-7579 ext 1 +---+ + ColocationDedicated Servers IPv4 IPv6 Transit + +---+ Connex Internet Services, Inc. direct: (916) 265-1568 11230 Gold Express Dr #310-313fax: (916) 880-5663 Gold River, CA 95670http://connexinternet.com +---+ ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: bind-users Digest, Vol 1361, Issue 2
Hi Kevin - Well for some reason, your message and someone else's never got back to me, saw it in the digest instead. I've got about 30 class C zones on this server and it's only handling rDNS for them; I figure theres a couple thousand actual PTR records. I did log queries for a while and they were all legit PTR lookups. Here's everything in named.conf except the zones themselves: options { directory /var; auth-nxdomain no; pid-file /var/run/named/named.pid; allow-recursion { localnets; }; allow-transfer { none; }; }; key rndc-key { algorithm hmac-md5; secret CeMgS23y0oWE20nyv0x40Q==; }; controls { inet 127.0.0.1 port 953 allow { 127.0.0.1; } keys { rndc-key; }; }; zone . { type hint; file named.root; }; zone 0.0.127.IN-ADDR.ARPA { type master; file localhost.rev; }; Here's a couple of zones, they are all pretty much the same: acl common-allow-transfer { }; zone 22.140.204.IN-ADDR.ARPA { type master; file 2/22.140.204.IN-ADDR.ARPA; allow-transfer { common-allow-transfer; }; notify yes; }; zone 3.245.173.IN-ADDR.ARPA { type master; file 3/3.245.173.IN-ADDR.ARPA; allow-transfer { 69.89.64.5; 65.97.49.34; common-allow-transfer; }; notify yes; }; zone 92.119.199.IN-ADDR.ARPA { type master; file 9/92.119.199.IN-ADDR.ARPA; allow-transfer { 75.98.129.21/32; 75.98.129.24/32; common-allow-transfer; }; notify yes; }; ...etc Thanks, Ed On 11/11/2012 1:57 PM, bind-users-requ...@lists.isc.org wrote: I wouldn't expect a nameserver process on Linux, hosting only a few reverse zones and doing nothing else, to be 71 megabytes in size; I just checked one of ours, serving*all* of our internal zone data, forward and reverse authoritative, plus some cached data for a significant number of zones delegated to business partners, and it's less than 100 Mb in size. Verify from your query logs, or by dumping cache, that it's*only* doing what it is supposed to do, and no more. If you've got a bunch of data in your cache, or a bunch of queries, that's unrelated to serving your reverse DNS, then that's probably the root cause of your problem. Consider turning off recursion, or severely limiting it, in order to enforce that the nameserver is only serving its intended purpose. 2Gb of memory is a little lean for a nameserver serving a*generic* Internet-name-lookup role... I guess another possibility is that you've gone crazy with your reverse zones (e.g. using $GENERATE willy-nilly), and thus are using up way more memory than you really need, to serve your reverse-resolution needs. - Kevin -- (800) 362-7579 ext 1 +---+ + ColocationDedicated Servers IPv4 IPv6 Transit + +---+ Connex Internet Services, Inc. direct: (916) 265-1568 11230 Gold Express Dr #310-313fax: (916) 880-5663 Gold River, CA 95670http://connexinternet.com +---+ ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users
Re: Need to improve named performance
* Ed LaFrance: Thanks for chiming in. Named is PID 8349 in my case. Here's a snippet of the output from strace: [pid 8351] send(3, 30Nov 11 13:07:25 named[8349]:..., 107, MSG_NOSIGNAL) = 107 0.015232 [pid 8353] send(3, 30Nov 11 13:07:25 named[8349]:..., 103, [pid 8353] ... send resumed )= 103 0.015034 This look like syslog logging is the culprit, each syslog message takes 15ms to complete. There could be several causes: syslogd is logging synchronously to disk (doing an fsync after each message), something else in the system is producing an extremely large number of messages (syslogd is single-threaded), or there is a request loop where writing out the syslog message for each reverse DNS request requires itself a reverse DNS lookup. You should also check if named is expected to log this many messages in the first place. You can pass -s 200 to strace to see more of the logging message, so this should help to identify what's going on. I don't think this has got anything to do with the particular BIND version you use. ___ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users