Re: [Nagios-users] Nagios 3.0.6 process hangs, then recovers
It's not in a VM, and I haven't been able to catch it when it it actually happening yet. BTW, the system is running CentOS 4.7 On Tue, Jul 28, 2009 at 8:52 AM, Brian A. Sekleckisekle...@noc.cfi.pgh.pa.us wrote: On Mon, 2009-07-27 at 17:33 -0500, Andrew Noonan wrote: have something to do with things... perhaps a hang in that module\ Did you ktrace(8)/strace(8) it out, yet? You're not running in a VM are you? ~BAS -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Nagios 3.0.6 process hangs, then recovers
Hi all, This is the second time this has happened to me... Nagios is working fine when suddenly it stops monitoring. Hours later, the process un-hangs and a message like: [1248653042] Warning: A system time change of 0d 3h 39m 9s (forwards in time) has been detected. Compensating... happens, followed by nagios complaining about orphaned checks and a rescheduling. The last time I published to the list, someone suggested that this was a time zone problem... that an actual system time change had occurred, but let me assure you, no such thing happened. Other process log files continued to log uninterrupted during this time, and there is no individual TZ setting for the nagios user. No NTP messages, etc. Plus, other then NTP problems, a TZ change would likely be a multiple of an hour or half hour. That being said, has anyone ever had these problems? I've had two of these in a month. The system was not loaded during this period, with plenty of memory and CPU to spare. I'm also running ndo2db, which I worry may have something to do with things... perhaps a hang in that module causes the process to spin? The system has about 1000 checks per 5 minutes, so not overwhelmingly busy. This is the last major hurdle before I begin using Nagios in production, but this is a pretty big problem. Any advice would help. Thanks, Andrew -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios lockup for about 8.5 hours
Sorry Kevin, I was out yesterday or I would have responded earlier. I don't think that's the case. I forgot to mention it in the earlier email, but I checked the log files of a periodic cron job that also runs on the same server every 5 minutes, and its logs show an uninterrupted timestamp. In addition, I also monitor NTP through nagios (and graph with PNP), and up until the outage, the local skew was less then a second. Thanks, Andrew On Tue, Jul 7, 2009 at 9:56 PM, Kevin Keanesubscript...@kkeane.com wrote: It seems to me that for some reason your system clock has changed by about five hours. Did you change your system by any chance from local time (Eastern time, probably, based on the five-hour difference) to UTC? Or maybe your clock had drifted for a long time. When the clock skew becomes too great, NTP refuses to update the time (because there is no way to be sure that the time signal isn't the one that's incorrect). If you restart NTP, it will set your clock regardless of the clock skew. The following immediate check messages probably occurred because Nagios thought that these services hadn't been checked for five hours. Andrew Noonan wrote: I've been testing out Nagios in general to replace our current system and I noticed a strange blank in my PNP graphs this morning. When I looked closer, I found that nagios had basically hung for several hours. Then, the log shows a warning of: [1246958195] Warning: A system time change of 0d 4h 56m 48s (forwards in time) has been detected. Compensating... and then for several hours, messages like: [1246958830] Warning: The check of host 'superhost1' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host... I'm running nagios 3.0.6 with ndo2db. The system has under 1000 services, most of which are nrpe checks to remote hosts. The nagios system was not terribly loaded at the time (about 50% idle) and mysql did not show any errors at the time. Typically, the number of buffers used is only 2-3 out of the 4096. Any ideas as to what this could have been, or how I can detect this condition or log to gain more info? I wouldn't think that this is normal, but my Google searches aren't turning up a lot. Thanks! -- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/blackberry ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Kevin Keane Owner The NetTech Find the Uncommon: Expert Solutions for a Network You Never Have to Think About Office: 866-642-7116 http://www.4nettech.com This e-mail and attachments, if any, may contain confidential and/or proprietary information. Please be advised that the unauthorized use or disclosure of the information is strictly prohibited. The information herein is intended only for use by the intended recipient(s) named above. If you have received this transmission in error, please notify the sender immediately and permanently delete the e-mail and any copies, printouts or attachments thereof. -- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when
Re: [Nagios-users] Nagios lockup for about 8.5 hours
Not that I can determine. We run everything in Central time, and the nagios user is currently running in that TZ. Actually, now that I think about it, it's a fairly moot point, as the logs that I listed don't have a timestamp, they are epoch stamps. Since those show a ~5 hour jump, this could not be a TZ change, but either a pause with Nagios, or the system clock would have to be changed, as that would effect the epoch values. Given the consistency of the 5-minute cron entries I mentioned earlier, I don't think a system clock change happened either. On Thu, Jul 9, 2009 at 8:56 AM, Kevin Keanesubscript...@kkeane.com wrote: Has the time zone that Nagios runs under changed, maybe? That would not affect the log files or NTP, since both usually always run on UTC. Andrew Noonan wrote: Sorry Kevin, I was out yesterday or I would have responded earlier. I don't think that's the case. I forgot to mention it in the earlier email, but I checked the log files of a periodic cron job that also runs on the same server every 5 minutes, and its logs show an uninterrupted timestamp. In addition, I also monitor NTP through nagios (and graph with PNP), and up until the outage, the local skew was less then a second. Thanks, Andrew On Tue, Jul 7, 2009 at 9:56 PM, Kevin Keanesubscript...@kkeane.com wrote: It seems to me that for some reason your system clock has changed by about five hours. Did you change your system by any chance from local time (Eastern time, probably, based on the five-hour difference) to UTC? Or maybe your clock had drifted for a long time. When the clock skew becomes too great, NTP refuses to update the time (because there is no way to be sure that the time signal isn't the one that's incorrect). If you restart NTP, it will set your clock regardless of the clock skew. The following immediate check messages probably occurred because Nagios thought that these services hadn't been checked for five hours. Andrew Noonan wrote: I've been testing out Nagios in general to replace our current system and I noticed a strange blank in my PNP graphs this morning. When I looked closer, I found that nagios had basically hung for several hours. Then, the log shows a warning of: [1246958195] Warning: A system time change of 0d 4h 56m 48s (forwards in time) has been detected. Compensating... and then for several hours, messages like: [1246958830] Warning: The check of host 'superhost1' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host... I'm running nagios 3.0.6 with ndo2db. The system has under 1000 services, most of which are nrpe checks to remote hosts. The nagios system was not terribly loaded at the time (about 50% idle) and mysql did not show any errors at the time. Typically, the number of buffers used is only 2-3 out of the 4096. Any ideas as to what this could have been, or how I can detect this condition or log to gain more info? I wouldn't think that this is normal, but my Google searches aren't turning up a lot. Thanks! -- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/blackberry ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Kevin Keane Owner The NetTech Find the Uncommon: Expert Solutions for a Network You Never Have to Think About Office: 866-642-7116 http://www.4nettech.com This e-mail and attachments, if any, may contain confidential and/or proprietary information. Please be advised that the unauthorized use or disclosure of the information is strictly prohibited. The information herein is intended only for use by the intended recipient(s) named above. If you have received this transmission in error, please notify the sender immediately and permanently delete the e-mail and any copies, printouts or attachments thereof. -- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists
[Nagios-users] Nagios lockup for about 8.5 hours
I've been testing out Nagios in general to replace our current system and I noticed a strange blank in my PNP graphs this morning. When I looked closer, I found that nagios had basically hung for several hours. Then, the log shows a warning of: [1246958195] Warning: A system time change of 0d 4h 56m 48s (forwards in time) has been detected. Compensating... and then for several hours, messages like: [1246958830] Warning: The check of host 'superhost1' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host... I'm running nagios 3.0.6 with ndo2db. The system has under 1000 services, most of which are nrpe checks to remote hosts. The nagios system was not terribly loaded at the time (about 50% idle) and mysql did not show any errors at the time. Typically, the number of buffers used is only 2-3 out of the 4096. Any ideas as to what this could have been, or how I can detect this condition or log to gain more info? I wouldn't think that this is normal, but my Google searches aren't turning up a lot. Thanks! -- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/blackberry ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] check_period help?
No one has any ideas on this, or am I just not posting the right info? On Mon, Jun 29, 2009 at 12:19 PM, Andrew Noonananoo...@gmail.com wrote: Hi all, I've got a service that I'm trying to monitor with different thresholds at different times of day. To do this, I created two timeperiods, covering 1:00am to 7:00am, and 7:00am to 1:00am (I think), two service templates that each use these periods, and two services that do the same check with differing thresholds, each inheriting the different service template. But when I look at the scheduling for these two services, it's almost opposite what I think it should be. The 'late' service is next scheduled at 00:00 and the 'normal' service is scheduled at 1:00am. I'm running 3.0.6. The other templates used do not change the check_period, except for the generic-service template. Here are the definitions: Time periods # define timeperiod { timeperiod_name 7a-1a_every_day alias from 7:00am to 1:00am tuesday 00:00-00:59,7:00-24:00 wednesday 00:00-00:59,7:00-24:00 sunday 00:00-00:59,7:00-24:00 thursday 00:00-00:59,7:00-24:00 saturday 00:00-00:59,7:00-24:00 monday 00:00-00:59,7:00-24:00 friday 00:00-00:59,7:00-24:00 } define timeperiod { timeperiod_name 1a-7a_every_day alias 1:00am to 7:00am tuesday 01:00-06:59 thursday 01:00-06:59 sunday 01:00-06:59 saturday 01:00-06:59 monday 01:00-06:59 friday 01:00-06:59 wednesday 01:00-06:59 } ### Service Templates ## define service { name Time-1a_to_7a check_period 7a-1a_every_day register 0 } define service { name Time-7a_to_1a check_period 1a-7a_every_day register 0 } ### Services # define service { host_name US service_description index use Freq-15-min-check,graphing_service_pnp,Time-7a_to_1a,generic-service check_command Check command here register 1 } define service { host_name US service_description lateindex use Freq-15-min-check,graphing_service_pnp,Time-1a_to_7a,generic-service check_command Check command here register 1 } Thanks, Andrew -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] check_period help?
Ah, yes. That would be the danger of using a GUI, I suppose. You're absolutely correct, I clicked on the wrong one when setting these up. Confusing everyone else was just a fun side-effect. That being said, I went ahead and changed those so they match up, but I'm still seeing strange behavior. The index service is next scheduled at 1:00am, and the lateindex service is next scheduled at 8:45am (it's 8:40am now here). I would have expected those services to be properly rescheduled to basically the exact opposite. The lateindex should be 1:00am and the index service should be the currently running service. And yes, I did check to make sure that the index service is running the correct service template and vice versa :) Do I need to do something to reset the scheduler over then a reload? Andrew This looks decidedly odd. Time-7a_to_1a uses the check_period 1a-7a_every_day. Either you were very confused when you named the timeperiods or you were very confused when you created these templates. Or you just want to confuse everyone else ;-) I'm guessing they should be reversed, no? -- Andreas Ericsson andreas.erics...@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] check_period help?
Hi all, I've got a service that I'm trying to monitor with different thresholds at different times of day. To do this, I created two timeperiods, covering 1:00am to 7:00am, and 7:00am to 1:00am (I think), two service templates that each use these periods, and two services that do the same check with differing thresholds, each inheriting the different service template. But when I look at the scheduling for these two services, it's almost opposite what I think it should be. The 'late' service is next scheduled at 00:00 and the 'normal' service is scheduled at 1:00am. I'm running 3.0.6. The other templates used do not change the check_period, except for the generic-service template. Here are the definitions: Time periods # define timeperiod { timeperiod_name 7a-1a_every_day alias from 7:00am to 1:00am tuesday 00:00-00:59,7:00-24:00 wednesday 00:00-00:59,7:00-24:00 sunday 00:00-00:59,7:00-24:00 thursday00:00-00:59,7:00-24:00 saturday00:00-00:59,7:00-24:00 monday 00:00-00:59,7:00-24:00 friday 00:00-00:59,7:00-24:00 } define timeperiod { timeperiod_name 1a-7a_every_day alias 1:00am to 7:00am tuesday 01:00-06:59 thursday01:00-06:59 sunday 01:00-06:59 saturday01:00-06:59 monday 01:00-06:59 friday 01:00-06:59 wednesday 01:00-06:59 } ### Service Templates ## define service { name Time-1a_to_7a check_period 7a-1a_every_day register 0 } define service { name Time-7a_to_1a check_period 1a-7a_every_day register 0 } ### Services # define service { host_name US service_description index use Freq-15-min-check,graphing_service_pnp,Time-7a_to_1a,generic-service check_command Check command here register1 } define service { host_name US service_description lateindex use Freq-15-min-check,graphing_service_pnp,Time-1a_to_7a,generic-service check_command Check command here register1 } Thanks, Andrew -- ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null