Hi everyone! Here's a pretty weird problem. We have a NMS server and a bunch of servers and equipment to keep an eye on with the help of net-snmp. Intermittently snmpwalk will fail to get data from one of the servers and end its attempt with this:
[root@nms ~]# snmpwalk -v2c -c $PASS $SERVER Timeout: No Response from $SERVER What happens to snmpd on the server that is being monitored is the following (in strace parlance): [root@server ~]# ps axuwf |grep snmp root 20630 0.0 0.7 80308 59808 ? Sl Jan28 14:29 /usr/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd.pid -a root 23964 0.0 0.0 3912 668 pts/3 S+ 16:38 0:00 \_ grep snmp [root@server ~]# strace -p 20630 Process 20630 attached - interrupt to quit select(18, [17], NULL, NULL, {0, 769000}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 1 (in [17], left {0, 458000}) read(17, "36\n", 1023) = 3 select(18, [17], NULL, NULL, {1, 0}) = 1 (in [17], left {1, 0}) --- SIGCHLD (Child exited) @ 0 (0) --- read(17, "", 1020) = 0 waitpid(24053, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 24053 close(17) = 0 pipe([15, 16]) = 0 pipe([17, 18]) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7f92928) = 24126 close(15) = 0 close(18) = 0 close(16) = 0 select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0}) = 0 (Timeout) select(18, [17], NULL, NULL, {1, 0} My first thought was to check the Internet link (yes, the server is not on the LAN or in DMZ), which looked fine by the way, and the server actually is a SIP server serving quite a bit of calls, but there were no issues related to poor link quality with the calls at all. I then tried to snmpalk localhost right on the server where snmpd is installed: [root@server ~]# snmpwalk -v2c -c $PASS localhost Timeout: No Response from localhost Now, this timeout lasts only for as long as (approximately) 20-30s. The problem is that Cacti running on NMS server has stopped to draw graphs since a long time. That is why I set out to figure why it was so. However, I couldn't pinpoint exactly what the problem is so far. Interestingly, the graps started reappearing again, but they're kind of incomplete. It looks to me as if Cacti poller fails to communicate with the server's snmpd for whatever reason. See screenshot at http://i.imgur.com/PI3yE.jpg to get a visual on these graphs and alleged incompleteness. Note, that this graph used to work just fine for some time. This is the server that was set up and configured by someone long before me, so I don't know whether anything has been changed. I've been told that there were no changes to the configuration of the NMS but the problem arose simply out of the blue. Can't tell how accurate that information is, but let's just assume it is. I've no clue how to further troubleshoot this problem and whether the timeouts I see in strace are the problem. Thought I'd write this message and hopefully someone will be able to help. Here's the system information where snmpd runs: [root@server ~]# uname -a Linux server 2.6.18-164.el5PAE #1 SMP Thu Sep 3 04:10:44 EDT 2009 i686 i686 i386 GNU/Linux [root@server ~]# cat /etc/redhat-release CentOS release 5.3 (Final) [root@server ~]# yum list installed |grep snmp net-snmp.i386 1:5.3.2.2-5.el5_3.2 installed net-snmp-devel.i386 1:5.3.2.2-5.el5_3.2 installed net-snmp-libs.i386 1:5.3.2.2-5.el5_3.2 installed net-snmp-perl.i386 1:5.3.2.2-5.el5_3.2 installed net-snmp-utils.i386 1:5.3.2.2-5.el5_3.2 installed Btw I checked some other hosts that has pretty graphs, and there are no these time-outs happening there. The configs for these two hosts are not awfully different, in fact pretty similar. I'd appreciate any help here, 'cuz I'm literally clueless as to how to get rid of these timeouts. ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ Net-snmp-users mailing list Net-snmp-users@lists.sourceforge.net Please see the following page to unsubscribe or change other options: https://lists.sourceforge.net/lists/listinfo/net-snmp-users