Intermittent time-outs

Ivan Lezhnjov Jr. Fri, 10 Feb 2012 06:21:33 -0800

Hi everyone!

Here's a pretty weird problem. We have a NMS server and a bunch of
servers and equipment to keep an eye on with the help of net-snmp.
Intermittently snmpwalk will fail to get data from one of the servers
and end its attempt with this:


[root@nms ~]# snmpwalk  -v2c -c $PASS $SERVER
Timeout: No Response from $SERVER

What happens to snmpd on the server that is being monitored is the
following (in strace parlance):

[root@server ~]# ps axuwf |grep snmp
root     20630  0.0  0.7  80308 59808 ?        Sl   Jan28  14:29
/usr/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd.pid -a
root     23964  0.0  0.0   3912   668 pts/3    S+   16:38   0:00
              \_ grep snmp
[root@server ~]# strace -p 20630
Process 20630 attached - interrupt to quit
select(18, [17], NULL, NULL, {0, 769000}) = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 1 (in [17], left {0, 458000})
read(17, "36\n", 1023)                  = 3
select(18, [17], NULL, NULL, {1, 0})    = 1 (in [17], left {1, 0})
--- SIGCHLD (Child exited) @ 0 (0) ---
read(17, "", 1020)                      = 0
waitpid(24053, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 24053
close(17)                               = 0
pipe([15, 16])                          = 0
pipe([17, 18])                          = 0
clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0xb7f92928) = 24126
close(15)                               = 0
close(18)                               = 0
close(16)                               = 0
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0})    = 0 (Timeout)
select(18, [17], NULL, NULL, {1, 0}

My first thought was to check the Internet link (yes, the server is
not on the LAN or in DMZ), which looked fine by the way, and the
server actually is a SIP server serving quite a bit of calls, but
there were no issues related to poor link quality with the calls at
all.

I then tried to snmpalk localhost right on the server where snmpd is installed:

[root@server ~]# snmpwalk  -v2c -c $PASS localhost
Timeout: No Response from localhost

Now, this timeout lasts only for as long as (approximately) 20-30s.

The problem is that Cacti running on NMS server has stopped to draw
graphs since a long time. That is why I set out to figure why it was
so. However, I couldn't pinpoint exactly what the problem is so far.
Interestingly, the graps started reappearing again, but they're kind
of incomplete. It looks to me as if Cacti poller fails to communicate
with the server's snmpd for whatever reason. See screenshot at
http://i.imgur.com/PI3yE.jpg to get a visual on these graphs and alleged
incompleteness. Note, that this graph used to work just fine for some
time. This is the server that was set up and configured by someone
long before me, so I don't know whether anything has been changed.
I've been told that there were no changes to the configuration of the
NMS but the problem arose simply out of the blue. Can't tell how
accurate that information is, but let's just assume it is.

I've no clue how to further troubleshoot this problem and whether the
timeouts I see in strace are the problem. Thought I'd write this
message and hopefully someone will be able to help.

Here's the system information where snmpd runs:

[root@server ~]# uname -a
Linux server 2.6.18-164.el5PAE #1 SMP Thu Sep 3 04:10:44 EDT 2009 i686
i686 i386 GNU/Linux
[root@server ~]# cat /etc/redhat-release
CentOS release 5.3 (Final)
[root@server ~]# yum list installed |grep snmp
net-snmp.i386                           1:5.3.2.2-5.el5_3.2            installed
net-snmp-devel.i386                     1:5.3.2.2-5.el5_3.2            installed
net-snmp-libs.i386                      1:5.3.2.2-5.el5_3.2            installed
net-snmp-perl.i386                      1:5.3.2.2-5.el5_3.2            installed
net-snmp-utils.i386                     1:5.3.2.2-5.el5_3.2            installed

Btw I checked some other hosts that has pretty graphs, and there are no
these time-outs happening there. The configs for these two hosts are not
awfully different, in fact pretty similar.

I'd appreciate any help here, 'cuz I'm literally clueless as to how to get rid
of these timeouts.

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Net-snmp-users mailing list
Net-snmp-users@lists.sourceforge.net
Please see the following page to unsubscribe or change other options:
https://lists.sourceforge.net/lists/listinfo/net-snmp-users

Intermittent time-outs

Reply via email to