Re: [Nagios-users] Nagios Core 3.2.3 host check retry interval
On Fri, 2010-11-19 at 11:20 -0500, Chris Beattie wrote: This time I'm trying a nearly-stock nagios.cfg file. The one I've been using predates Nagios 3.0. Though it's been updated some, it doesn't contain all the more-recent settings. I was out of town for a bit. This is still happening, but not all the time. Most of the host checks happen 70 seconds apart, but the too-closely spaced ones are usually 20 seconds apart. I don't know how long this has been the case. It turns out it doesn't usually result in a notification, so nobody's complaining. [11-30-2010 17:13:03] SERVICE ALERT: bgcprodiceweb4d;Service: ScaleOut;CRITICAL;SOFT;1;SOSS: Not found [11-30-2010 17:14:33] SERVICE ALERT: bgcprodiceweb4d;Service: AntiVirus;WARNING;SOFT;1;No data was received from host! [11-30-2010 17:14:43] HOST ALERT: bgcprodiceweb4d;DOWN;SOFT;1;CRITICAL - 10.3.54.208: rta nan, lost 100% [11-30-2010 17:15:03] HOST ALERT: bgcprodiceweb4d;UP;SOFT;2;OK - 10.3.54.208: rta 33.504ms, lost 0% Nothing in this message is intended to make or accept an offer or to form a contract, except that an attachment that is an image of a contract bearing the signature of an officer of our company may be or become a contract. This message (including any attachments) is intended only for the use of the individual or entity to whom it is addressed. It may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law or may constitute as attorney work product. If you are not the intended recipient, we hereby notify you that any use, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this message in error, please notify us immediately by telephone and delete this message immediately. Thank you. -- Increase Visibility of Your 3D Game App Earn a Chance To Win $500! Tap into the largest installed PC base get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios Core 3.2.3 host check retry interval
On Tue, 2010-11-16 at 22:52 +0100, Andreas Ericsson wrote: That one was in 3.2.2 too though. Could you try un-commenting the lines mentioned there and see if that helps? It looks like something weird is still happening after making that change. I checked some more hosts and the retry_interval is low, but only for HOST UP alerts. [11-18-2010 01:23:31] SERVICE ALERT: hcsprodnwweb5;Service: Epilog;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [11-18-2010 01:23:41] HOST ALERT: hcsprodnwweb5;DOWN;SOFT;1;CRITICAL - 10.3.2.177: rta nan, lost 100% [11-18-2010 01:24:01] HOST ALERT: hcsprodnwweb5;UP;SOFT;2;OK - 10.3.2.177: rta 1.943ms, lost 0% [11-18-2010 01:32:51] HOST ALERT: wwwhost;DOWN;SOFT;2;CRITICAL - 10.3.1.11: rta nan, lost 100% [11-18-2010 01:34:02] HOST ALERT: wwwhost;DOWN;HARD;3;CRITICAL - 10.3.1.11: rta nan, lost 100% [11-18-2010 01:34:21] HOST ALERT: wwwhost;UP;HARD;1;OK - 10.3.1.11: rta 115.733ms, lost 20% But sometimes it works the way I expect it to. [11-18-2010 01:38:41] HOST ALERT: wwwhost;DOWN;SOFT;2;CRITICAL - 10.3.1.11: rta nan, lost 100% [11-18-2010 01:39:51] HOST ALERT: wwwhost;DOWN;HARD;3;CRITICAL - 10.3.1.11: rta 488.367ms, lost 80% [11-18-2010 01:49:21] HOST ALERT: wwwhost;UP;HARD;1;OK - 10.3.1.11: rta 31.928ms, lost 0% I'm going to try reverting back to Nagios 3.2.1 to see what happens. It's possible I had the problem then but never noticed. Nothing in this message is intended to make or accept an offer or to form a contract, except that an attachment that is an image of a contract bearing the signature of an officer of our company may be or become a contract. This message (including any attachments) is intended only for the use of the individual or entity to whom it is addressed. It may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law or may constitute as attorney work product. If you are not the intended recipient, we hereby notify you that any use, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this message in error, please notify us immediately by telephone and delete this message immediately. Thank you. -- Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today http://p.sf.net/sfu/msIE9-sfdev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios Core 3.2.3 host check retry interval
On Tue, 2010-11-16 at 22:52 +0100, Andreas Ericsson wrote: http://git.op5.org/git/?p=nagios.git;a=commitdiff;h=1149d275011d7c4d8631b44dbba30ebdb4d7e83f That one was in 3.2.2 too though. Could you try un-commenting the lines mentioned there and see if that helps? I won't revert that patch, but it Thanks for the help. So I can make sure I've correctly done what you asked, this is what I did. I removed lines 1415 and 1419 below from checks.c, then did a make clean, make all, make install, and restarted Nagios. 1414:/* Below removed 08/04/2010 EG - http://tracker.nagios.org/view.php?id=128 */ 1415-/* 1416-temp_service-state_type=HARD_STATE; 1417-temp_service-last_hard_state=temp_service-current_state; 1418-temp_service-current_attempt=1; 1419-*/ If that's right, I'll keep an eye on the frequency of our host alerts and see what happens. Nothing in this message is intended to make or accept an offer or to form a contract, except that an attachment that is an image of a contract bearing the signature of an officer of our company may be or become a contract. This message (including any attachments) is intended only for the use of the individual or entity to whom it is addressed. It may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law or may constitute as attorney work product. If you are not the intended recipient, we hereby notify you that any use, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this message in error, please notify us immediately by telephone and delete this message immediately. Thank you. -- Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today http://p.sf.net/sfu/msIE9-sfdev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios Core 3.2.3 host check retry interval
On 11/17/2010 03:55 PM, Chris Beattie wrote: On Tue, 2010-11-16 at 22:52 +0100, Andreas Ericsson wrote: http://git.op5.org/git/?p=nagios.git;a=commitdiff;h=1149d275011d7c4d8631b44dbba30ebdb4d7e83f That one was in 3.2.2 too though. Could you try un-commenting the lines mentioned there and see if that helps? I won't revert that patch, but it Thanks for the help. So I can make sure I've correctly done what you asked, this is what I did. I removed lines 1415 and 1419 below from checks.c, then did a make clean, make all, make install, and restarted Nagios. That sounds about right, yes. 1414:/* Below removed 08/04/2010 EG - http://tracker.nagios.org/view.php?id=128 */ 1415-/* 1416-temp_service-state_type=HARD_STATE; 1417-temp_service-last_hard_state=temp_service-current_state; 1418-temp_service-current_attempt=1; 1419-*/ If that's right, I'll keep an eye on the frequency of our host alerts and see what happens. Neat. Thanks. -- Andreas Ericsson andreas.erics...@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. -- Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today http://p.sf.net/sfu/msIE9-sfdev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Nagios Core 3.2.3 host check retry interval
I noticed something curious. It looks like Nagios 3.2.3 is making on-demand host checks faster than the retry_interval should allow. The interval_length is set to 60 and the retry_interval is set to 1. Nagios and the plugins were compiled from source on CentOS 5.5 x64. I'm not sure if this is related to Yu Watanabe's problem (http://www.mail-archive.com/nagios-users@lists.sourceforge.net/msg34042 .html) because I didn't start having it until after I upgraded to 3.2.3. Here are some alerts from October when I was running Nagios 3.2.1. There were service alerts too, but the host checks do not occur less than one minute from each other: -- [10-10-2010 06:41:29] HOST ALERT: wwwhost;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 50.10 ms [10-10-2010 06:28:40] HOST ALERT: wwwhost;DOWN;HARD;3;PING CRITICAL - Packet loss = 100% [10-10-2010 06:27:29] HOST ALERT: wwwhost;DOWN;SOFT;2;PING CRITICAL - Packet loss = 100% [10-10-2010 06:26:19] HOST ALERT: wwwhost;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100% -- Here's some from earlier this month, after I'd switched from check_ping to check_icmp. Again, there were service alerts, but the host checks are still about a minute apart: -- [11-07-2010 21:55:53] HOST ALERT: wwwhost;UP;SOFT;2;OK - 10.3.1.11: rta 4.480ms, lost 0% [11-07-2010 21:54:43] HOST ALERT: wwwhost;DOWN;SOFT;1;CRITICAL - 10.3.1.11: rta nan, lost 100% -- [11-09-2010 23:40:15] HOST ALERT: wwwhost;UP;SOFT;2;OK - 10.3.1.11: rta 1.018ms, lost 0% [11-09-2010 23:39:15] HOST ALERT: wwwhost;DOWN;SOFT;1;CRITICAL - 10.3.1.11: rta 650.987ms, lost 80% -- On November 12th, I upgraded to Nagios 3.2.3 and the 1.4.15 plugins, and got this later that evening. The host checks were only about 20 seconds apart: -- [11-12-2010 23:46:43] SERVICE ALERT: wwwhost;Counter: IIS Web Connections;OK;SOFT;2;Web Sessions: 2 [11-12-2010 23:45:14] HOST ALERT: wwwhost;UP;SOFT;2;OK - 10.3.1.11: rta 0.985ms, lost 0% [11-12-2010 23:44:53] HOST ALERT: wwwhost;DOWN;SOFT;1;CRITICAL - 10.3.1.11: rta 355.633ms, lost 80% [11-12-2010 23:44:44] SERVICE ALERT: wwwhost;Counter: IIS Web Connections;WARNING;SOFT;1;No data was received from host! -- Two days later, it looked like it was behaving properly: -- [11-14-2010 23:44:57] HOST ALERT: wwwhost;UP;SOFT;2;OK - 10.3.1.11: rta 1.338ms, lost 0% [11-14-2010 23:44:27] SERVICE ALERT: wwwhost;Service: Snare;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds [11-14-2010 23:44:27] SERVICE ALERT: wwwhost;Service: RServer3;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds [11-14-2010 23:43:34] HOST ALERT: wwwhost;DOWN;SOFT;1;CRITICAL - 10.3.1.11: rta 860.577ms, lost 80% [11-14-2010 23:43:22] SERVICE ALERT: wwwhost;Service: Epilog;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds -- [11-14-2010 08:56:55] HOST ALERT: wwwhost;UP;SOFT;2;OK - 10.3.1.11: rta 2.633ms, lost 0% [11-14-2010 08:55:45] HOST ALERT: wwwhost;DOWN;SOFT;1;CRITICAL - 10.3.1.11: rta 518.822ms, lost 80% [11-14-2010 08:55:36] SERVICE ALERT: wwwhost;Counter: IIS Web Connections;WARNING;SOFT;1;No data was received from host! -- Last night, however, the host got rechecked at short intervals: -- [11-15-2010 23:56:09] HOST ALERT: wwwhost;UP;SOFT;3;WARNING - 10.3.1.11: rta 89.448ms, lost 40% [11-15-2010 23:55:39] HOST ALERT: wwwhost;DOWN;SOFT;2;CRITICAL - 10.3.1.11: rta 984.594ms, lost 80% [11-15-2010 23:55:21] HOST ALERT: wwwhost;DOWN;SOFT;1;CRITICAL - 10.3.1.11: rta 738.100ms, lost 80% [11-15-2010 23:55:09] SERVICE ALERT: wwwhost;CPU;WARNING;SOFT;1;No data was received from host! [11-15-2010 23:54:00] HOST FLAPPING ALERT: wwwhost;STARTED; Host appears to have started flapping (23.0% change 20.0% threshold) [11-15-2010 23:53:59] HOST ALERT: wwwhost;UP;HARD;1;WARNING - 10.3.1.11: rta 183.851ms, lost 60% [11-15-2010 23:53:29] HOST ALERT: wwwhost;DOWN;HARD;3;CRITICAL - 10.3.1.11: rta nan, lost 100% [11-15-2010 23:53:29] SERVICE
Re: [Nagios-users] Nagios Core 3.2.3 host check retry interval
On 11/16/2010 09:59 PM, Chris Beattie wrote: I noticed something curious. It looks like Nagios 3.2.3 is making on-demand host checks faster than the retry_interval should allow. The interval_length is set to 60 and the retry_interval is set to 1. Nagios and the plugins were compiled from source on CentOS 5.5 x64. Very curious indeed. The only thing I can see that might trigger something like this is the following patch: http://git.op5.org/git/?p=nagios.git;a=commitdiff;h=1149d275011d7c4d8631b44dbba30ebdb4d7e83f That one was in 3.2.2 too though. Could you try un-commenting the lines mentioned there and see if that helps? I won't revert that patch, but it would give me a pretty good idea of where to start the bug-hunt. Thanks. -- Andreas Ericsson andreas.erics...@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. -- Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today http://p.sf.net/sfu/msIE9-sfdev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null