Re: Need to improve named performance

2012-11-11 Thread G.W. Haywood

Hi there,

On Sun, 11 Nov 2012, Ed LaFrance wrote:


Running BIND 9.3.6-P1-RedHat-9.3.6-16.P1.el5 ...


Somebody already said upgrade.  Generally that's the first thing to do
in a case like this (before asking on mailing lists:).


The issue is that named is not keeping up with rdns requests. The
nameserver is only doing rdns, and it's the only public process on the
server (no webhosting, monitoring, etc).

When I check the router above this server I'll see 200 - 500 legitimate
connections to this server at any given time. ...


I'm not convinced that BIND is the problem.  What does 'top' tell you?

Are you running netfilter/iptables on the box?  Might be ip_conntrack.
I once had an issue with a lot of dropped TCP connections, each of
which was hanging around for five days (the default).  They filled the
connection tracking table.  The default is too long, ridiculously so.
After I reduced it to something more reasonable the problem went away.

--

73,
Ged.
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Lots of RSA_verify failed after upgrade to 9.7.7

2012-11-11 Thread Evan Hunt
 But not for 9.7, since 9.7 is EOL since november 2012. Correct?

Yes, that's correct.

If you're stuck on 9.7 for the time being, you can silence
the RSA_verify warnings with the change I mentioned in
http://www.mail-archive.com/bind-users@lists.isc.org/msg14747.html

(It's not the fix we used for the maintenance release, but
it'll serve.)

-- 
Evan Hunt -- e...@isc.org
Internet Systems Consortium, Inc.
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Need to improve named performance

2012-11-11 Thread Kevin Darcy

On 11/10/2012 1:39 PM, Ed LaFrance wrote:

Hello all -

First post to this list, hope I'm on the right place.

Running BIND 9.3.6-P1-RedHat-9.3.6-16.P1.el5 on a quadcore xeon server 
(3Ghz) with 2GB RAM. Named is being used only for rDNS queries against 
our address space.


The issue is that named is not keeping up with rdns requests. The 
nameserver is only doing rdns, and it's the only public process on the 
server (no webhosting, monitoring, etc).


When I check the router above this server I'll see 200 - 500 
legitimate connections to this server at any given time. This is 
what's happening: named is not keeping up with the requests, so the 
network receive queue fills up - I can see this with netstat:


netstat -tulpn | grep :53
Proto Recv-Q Send-Q Local Address   Foreign Address 
PID/Program name

...
udp   110048  0 xxx.xxx.xxx.xxx:53   0.0.0.0:* 3918/named
udp   110048  0 xxx.xxx.xxx.xxx:53 0.0.0.0:* 3918/named

(two different IPs are on this machine to handle rDNS reqeusts)

Once the queue gets near the max value set by sysctl, udp packets 
start to drop - this can also be seen in netstat:


 netstat -su
...
Udp:
5157567 packets received
9761 packets to unknown port received.
1164232 packet receive errors
5157554 packets sent

The errors apparently correspond to drops; the only increase when the 
queue is full.


Of course by this point dns queries are timing out. I've tried 
increasing the queue size with sysctl using this command:


sysctl -w net.core.rmem_max=1048576 net.core.rmem_default=10485

then restarting named; that did eliminate the drops, but the queue 
grows gigantic and I get pretty much 100% dns lookup timeouts at that 
point.


The server loading is about 2.0 - busy, not not overwhelmed, I can run 
a shell or even a gui session on it with ease so it's by no means 
maxed out. Here's the first slice of top output:


top - 09:13:38 up 18:40,  1 user,  load average: 2.09, 2.05, 2.00
Tasks: 175 total,   1 running, 174 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us,  0.2%sy,  0.0%ni, 74.8%id, 24.7%wa,  0.0%hi, 0.2%si, 
0.0%st

Mem:   2074984k total,  1743584k used,   331400k free,   166588k buffers
Swap:  4128760k total,   28k used,  4128732k free,  1270032k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+ COMMAND
 4509 named 24   0 71004 4580 2036 S  1.3  0.2   0:46.74 named
 6877 root  15   0  2428 1064  788 R  0.7  0.1   0:00.04 top
  467 root  10  -5 000 D  0.3  0.0   2:59.13 kjournald
 2460 root  18   0  1816  584  484 D  0.3  0.0   3:30.35 syslogd
1 root  15   0  2160  644  556 S  0.0  0.0   0:01.08 init

The bottom line is: I need to improve named performance. Tcpdump only 
shows about 20 requests per second on average, I would estimate. This 
should be handled easily, but instead it's gagging on it and the 
requests are stacking up. If you have any ideas, I welcome your input. 
Here's named.conf, it's pretty basic for the global config, the data 
for each zone is stored separately elsewhere:


options {
directory /var;
auth-nxdomain no;
pid-file /var/run/named/named.pid;
allow-recursion {
localnets;
};

allow-transfer {
none;
};
};

key rndc-key {
algorithm hmac-md5;
secret xx;
};

controls {
inet 127.0.0.1 port 953
allow { 127.0.0.1; } keys { rndc-key; };
};

zone . {
type hint;
file named.root;
};

zone 0.0.127.IN-ADDR.ARPA {
type master;
file localhost.rev;
};


I wouldn't expect a nameserver process on Linux, hosting only a few 
reverse zones and doing nothing else, to be 71 megabytes in size; I just 
checked one of ours, serving *all* of our internal zone data, forward 
and reverse authoritative, plus some cached data for a significant 
number of zones delegated to business partners, and it's less than 100 
Mb in size.


Verify from your query logs, or by dumping cache, that it's *only* doing 
what it is supposed to do, and no more. If you've got a bunch of data in 
your cache, or a bunch of queries, that's unrelated to serving your 
reverse DNS, then that's probably the root cause of your problem. 
Consider turning off recursion, or severely limiting it, in order to 
enforce that the nameserver is only serving its intended purpose. 2Gb of 
memory is a little lean for a nameserver serving a *generic* 
Internet-name-lookup role...


I guess another possibility is that you've gone crazy with your reverse 
zones (e.g. using $GENERATE willy-nilly), and thus are using up way more 
memory than you really need, to serve your reverse-resolution needs.


- Kevin


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org

Re: Need to improve named performance

2012-11-11 Thread Florian Weimer
* Ed LaFrance:

 Running BIND 9.3.6-P1-RedHat-9.3.6-16.P1.el5 on a quadcore xeon server
 (3Ghz) with 2GB RAM. Named is being used only for rDNS queries against
 our address space.

You should really upgrade to the latest version on that branch (likely
bind-9.3.6-20.P1.el5_8.5).

 The bottom line is: I need to improve named performance. Tcpdump only
 shows about 20 requests per second on average, I would estimate. This
 should be handled easily, but instead it's gagging on it and the
 requests are stacking up.

Something is stalling the named process.  Try to run strace -T -f -p
4509 (4509 is the PID for the named process) and see where named
spends its time.  The top output you quoted suggests that the process
is not spinning in user space.
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Need to improve named performance

2012-11-11 Thread Ed LaFrance

Hello -

Thanks for chiming in. Named is PID 8349 in my case. Here's a snippet of 
the output from strace:


[pid  8351] time( unfinished ...
[pid  8352] ... sendmsg resumed ) = 56 0.000104
[pid  8352] recvmsg(515, {msg_name(16)={sa_family=AF_INET, 
sin_port=htons(38385), sin_addr=inet_addr(205.188.158.143)}, 
msg_iov(1)=[{Q\0\0\0\1\0\0\0\0\0\1\003157\003161\00272\00264\7in-ad..., 
4096}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, 
cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=0}, 0) = 55 0.31

[pid  8351] ... time resumed NULL)= 1352668045 0.000353
[pid  8352] futex(0x9b6aecc, FUTEX_WAIT_PRIVATE, 2, NULL unfinished ...
[pid  8351] stat64(/etc/localtime, {st_mode=S_IFREG|0644, 
st_size=2819, ...}) = 0 0.000109
[pid  8351] stat64(/etc/localtime, {st_mode=S_IFREG|0644, 
st_size=2819, ...}) = 0 0.86
[pid  8351] stat64(/etc/localtime, {st_mode=S_IFREG|0644, 
st_size=2819, ...}) = 0 0.84
[pid  8351] send(3, 30Nov 11 13:07:25 named[8349]:..., 107, 
MSG_NOSIGNAL) = 107 0.015232

[pid  8351] futex(0x9b6aecc, FUTEX_WAKE_PRIVATE, 1 unfinished ...
[pid  8353] ... futex resumed )   = 0 0.052813
[pid  8351] ... futex resumed )   = 1 0.000125
[pid  8353] time(NULL)  = 1352668045 0.20
[pid  8353] stat64(/etc/localtime, {st_mode=S_IFREG|0644, 
st_size=2819, ...}) = 0 0.25
[pid  8353] stat64(/etc/localtime, {st_mode=S_IFREG|0644, 
st_size=2819, ...}) = 0 0.22
[pid  8351] sendmsg(513, {msg_name(16)={sa_family=AF_INET, 
sin_port=htons(38162), sin_addr=inet_addr(205.188.158.207)}, 
msg_iov(1)=[{@%\204\0\0\1\0\1\0\2\0\1\003249\00221\003140\003204\7in-a..., 
138}], msg_controllen=0, msg_flags=0}, 0 unfinished ...

[pid  8353] stat64(/etc/localtime,  unfinished ...
[pid  8351] ... sendmsg resumed ) = 138 0.48
[pid  8353] ... stat64 resumed {st_mode=S_IFREG|0644, st_size=2819, 
...}) = 0 0.41

[pid  8351] recvmsg(513,  unfinished ...
[pid  8353] send(3, 30Nov 11 13:07:25 named[8349]:..., 103, 
MSG_NOSIGNAL unfinished ...
[pid  8351] ... recvmsg resumed {msg_name(16)={sa_family=AF_INET, 
sin_port=htons(53507), sin_addr=inet_addr(205.188.158.206)}, 
msg_iov(1)=[{\244\273\0\0\0\1\0\0\0\0\0\1\003246\003161\00272\00264\7in-ad..., 
4096}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, 
cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=0}, 0) = 55 0.86

[pid  8351] futex(0x9b6aecc, FUTEX_WAIT_PRIVATE, 2, NULL unfinished ...
[pid  8353] ... send resumed )= 103 0.015034
[pid  8353] futex(0x9b6aecc, FUTEX_WAKE_PRIVATE, 1) = 1 0.25
[pid  8350] ... futex resumed )   = 0 0.051772
[pid  8350] time( unfinished ...
[pid  8353] sendmsg(513, {msg_name(16)={sa_family=AF_INET, 
sin_port=htons(60702), sin_addr=inet_addr(64.12.139.17)}, 
msg_iov(1)=[{\343F\204\0\0\1\0\1\0\2\0\1\003251\003160\00272\00264\7in-ad..., 
151}], msg_controllen=0, msg_flags=0}, 0 unfinished ...

[pid  8350] ... time resumed NULL)= 1352668045 0.000210
[pid  8353] ... sendmsg resumed ) = 151 0.84
[pid  8350] stat64(/etc/localtime,  unfinished ...
[pid  8353] recvmsg(513,  unfinished ...
[pid  8350] ... stat64 resumed {st_mode=S_IFREG|0644, st_size=2819, 
...}) = 0 0.85
[pid  8353] ... recvmsg resumed {msg_name(16)={sa_family=AF_INET, 
sin_port=htons(3794), sin_addr=inet_addr(64.12.139.19)}, 
msg_iov(1)=[{|\354\0\0\0\1\0\0\0\0\0\1\00230\003160\00272\00264\7in-add..., 
4096}], msg_controllen=20, {cmsg_len=20, cmsg_level=SOL_SOCKET, 
cmsg_type=0x1d /* SCM_??? */, ...}, msg_flags=0}, 0) = 54 0.000150

[pid  8350] stat64(/etc/localtime,  unfinished ...
[pid  8353] futex(0x9b6aecc, FUTEX_WAIT_PRIVATE, 2, NULL unfinished ...
[pid  8350] ... stat64 resumed {st_mode=S_IFREG|0644, st_size=2819, 
...}) = 0 0.76
[pid  8350] stat64(/etc/localtime, {st_mode=S_IFREG|0644, 
st_size=2819, ...}) = 0 0.29
[pid  8350] send(3, 30Nov 11 13:07:25 named[8349]:..., 102, 
MSG_NOSIGNAL unfinished ...



On 11/11/2012 1:46 PM, Florian Weimer wrote:

* Ed LaFrance:


Running BIND 9.3.6-P1-RedHat-9.3.6-16.P1.el5 on a quadcore xeon server
(3Ghz) with 2GB RAM. Named is being used only for rDNS queries against
our address space.


You should really upgrade to the latest version on that branch (likely
bind-9.3.6-20.P1.el5_8.5).


The bottom line is: I need to improve named performance. Tcpdump only
shows about 20 requests per second on average, I would estimate. This
should be handled easily, but instead it's gagging on it and the
requests are stacking up.


Something is stalling the named process.  Try to run strace -T -f -p
4509 (4509 is the PID for the named process) and see where named
spends its time.  The top output you quoted suggests that the process
is not spinning in user space.



--
(800) 362-7579 ext 1

+---+
+ ColocationDedicated Servers   IPv4  IPv6 Transit +
+---+
Connex Internet Services, Inc. direct: (916) 265-1568
11230 Gold Express Dr #310-313

Re: bind-users Digest, Vol 1361, Issue 2

2012-11-11 Thread Ed LaFrance
Did not get your post for some reason. I am running IP tables with a 
simple firewall setup. No idea on ip_conntrack. How do I check and if 
so, what setting should I try and how do I do it?


Thanks!
Ed


--

Message: 1
Date: Sun, 11 Nov 2012 12:41:53 + (GMT)
From: G.W. Haywoodb...@jubileegroup.co.uk
To:bind-users@lists.isc.org
Subject: Re: Need to improve named performance
Message-ID:
pine.lnx.4.64.121236160.19...@mail5.jubileegroup.co.uk
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

Hi there,

On Sun, 11 Nov 2012, Ed LaFrance wrote:


  Running BIND 9.3.6-P1-RedHat-9.3.6-16.P1.el5 ...

Somebody already said upgrade.  Generally that's the first thing to do
in a case like this (before asking on mailing lists:).


  The issue is that named is not keeping up with rdns requests. The
  nameserver is only doing rdns, and it's the only public process on the
  server (no webhosting, monitoring, etc).

  When I check the router above this server I'll see 200 - 500 legitimate
  connections to this server at any given time. ...

I'm not convinced that BIND is the problem.  What does 'top' tell you?

Are you running netfilter/iptables on the box?  Might be ip_conntrack.
I once had an issue with a lot of dropped TCP connections, each of
which was hanging around for five days (the default).  They filled the
connection tracking table.  The default is too long, ridiculously so.
After I reduced it to something more reasonable the problem went away.

--

73,
Ged.



--
(800) 362-7579 ext 1

+---+
+ ColocationDedicated Servers   IPv4  IPv6 Transit +
+---+
Connex Internet Services, Inc. direct: (916) 265-1568
11230 Gold Express Dr #310-313fax: (916) 880-5663
Gold River, CA 95670http://connexinternet.com
+---+
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: bind-users Digest, Vol 1361, Issue 2

2012-11-11 Thread Ed LaFrance

Hi Kevin -

Well for some reason, your message and someone else's never got back to 
me, saw it in the digest instead.


I've got about 30 class C zones on this server and it's only handling 
rDNS for them; I figure theres a couple thousand actual PTR records.


I did log queries for a while and they were all legit PTR lookups. 
Here's everything in named.conf except the zones themselves:


options {
directory /var;
auth-nxdomain no;
pid-file /var/run/named/named.pid;
allow-recursion {
localnets;
};

allow-transfer {
none;
};
};

key rndc-key {
algorithm hmac-md5;
secret CeMgS23y0oWE20nyv0x40Q==;
};

controls {
inet 127.0.0.1 port 953
allow { 127.0.0.1; } keys { rndc-key; };
};

zone . {
type hint;
file named.root;
};

zone 0.0.127.IN-ADDR.ARPA {
type master;
file localhost.rev;
};

Here's a couple of zones, they are all pretty much the same:

acl common-allow-transfer {
};
zone 22.140.204.IN-ADDR.ARPA {
type master;
file 2/22.140.204.IN-ADDR.ARPA;
allow-transfer {
common-allow-transfer;
};
notify yes;
};
zone 3.245.173.IN-ADDR.ARPA {
type master;
file 3/3.245.173.IN-ADDR.ARPA;
allow-transfer {
69.89.64.5;
65.97.49.34;
common-allow-transfer;
};
notify yes;
};
zone 92.119.199.IN-ADDR.ARPA {
type master;
file 9/92.119.199.IN-ADDR.ARPA;
allow-transfer {
75.98.129.21/32;
75.98.129.24/32;
common-allow-transfer;
};
notify yes;
};
...etc


Thanks,

Ed

On 11/11/2012 1:57 PM, bind-users-requ...@lists.isc.org wrote:

I wouldn't expect a nameserver process on Linux, hosting only a few
reverse zones and doing nothing else, to be 71 megabytes in size; I just
checked one of ours, serving*all*  of our internal zone data, forward
and reverse authoritative, plus some cached data for a significant
number of zones delegated to business partners, and it's less than 100
Mb in size.

Verify from your query logs, or by dumping cache, that it's*only*  doing
what it is supposed to do, and no more. If you've got a bunch of data in
your cache, or a bunch of queries, that's unrelated to serving your
reverse DNS, then that's probably the root cause of your problem.
Consider turning off recursion, or severely limiting it, in order to
enforce that the nameserver is only serving its intended purpose. 2Gb of
memory is a little lean for a nameserver serving a*generic*
Internet-name-lookup role...

I guess another possibility is that you've gone crazy with your reverse
zones (e.g. using $GENERATE willy-nilly), and thus are using up way more
memory than you really need, to serve your reverse-resolution needs.

  - Kevin


--
(800) 362-7579 ext 1

+---+
+ ColocationDedicated Servers   IPv4  IPv6 Transit +
+---+
Connex Internet Services, Inc. direct: (916) 265-1568
11230 Gold Express Dr #310-313fax: (916) 880-5663
Gold River, CA 95670http://connexinternet.com
+---+
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Need to improve named performance

2012-11-11 Thread Florian Weimer
* Ed LaFrance:

 Thanks for chiming in. Named is PID 8349 in my case. Here's a snippet
 of the output from strace:

 [pid  8351] send(3, 30Nov 11 13:07:25 named[8349]:..., 107,
 MSG_NOSIGNAL) = 107 0.015232

 [pid  8353] send(3, 30Nov 11 13:07:25 named[8349]:..., 103,

 [pid  8353] ... send resumed )= 103 0.015034

This look like syslog logging is the culprit, each syslog message
takes 15ms to complete.

There could be several causes: syslogd is logging synchronously to
disk (doing an fsync after each message), something else in the system
is producing an extremely large number of messages (syslogd is
single-threaded), or there is a request loop where writing out the
syslog message for each reverse DNS request requires itself a reverse
DNS lookup.

You should also check if named is expected to log this many messages
in the first place.  You can pass -s 200 to strace to see more of
the logging message, so this should help to identify what's going on.

I don't think this has got anything to do with the particular BIND
version you use.
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users