Re: [Nagios-users] Nagios 3.0.6 process hangs, then recovers

2009-07-28 Thread Andrew Noonan
It's not in a VM, and I haven't been able to catch it when it it
actually happening yet.  BTW, the system is running CentOS 4.7

On Tue, Jul 28, 2009 at 8:52 AM, Brian A.
Sekleckisekle...@noc.cfi.pgh.pa.us wrote:
 On Mon, 2009-07-27 at 17:33 -0500, Andrew Noonan wrote:
 have something to do with things... perhaps a hang in that module\

 Did you ktrace(8)/strace(8) it out, yet?  You're not running in a VM are
 you?  ~BAS




--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Nagios 3.0.6 process hangs, then recovers

2009-07-27 Thread Andrew Noonan
Hi all,

This is the second time this has happened to me... Nagios is working
fine when suddenly it stops monitoring.  Hours later, the process
un-hangs and a message like:

[1248653042] Warning: A system time change of 0d 3h 39m 9s (forwards
in time) has been detected.  Compensating...

happens, followed by nagios complaining about orphaned checks and a
rescheduling.  The last time I published to the list, someone
suggested that this was a time zone problem... that an actual system
time change had occurred, but let me assure you, no such thing
happened.  Other process log files continued to log uninterrupted
during this time, and there is no individual TZ setting for the nagios
user.  No NTP messages, etc.  Plus, other then NTP problems, a TZ
change would likely be a multiple of an hour or half hour.  That being
said, has anyone ever had these problems?  I've had two of these in a
month.  The system was not loaded during this period, with plenty of
memory and CPU to spare.  I'm also running ndo2db, which I worry may
have something to do with things... perhaps a hang in that module
causes the process to spin?  The system has about 1000 checks per 5
minutes, so not overwhelmingly busy.

This is the last major hurdle before I begin using Nagios in
production, but this is a pretty big problem.  Any advice would help.

Thanks,
Andrew

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Nagios lockup for about 8.5 hours

2009-07-09 Thread Andrew Noonan
Sorry Kevin, I was out yesterday or I would have responded earlier.  I
don't think that's the case.  I forgot to mention it in the earlier
email, but I checked the log files of a periodic cron job that also
runs on the same server every 5 minutes, and its logs show an
uninterrupted timestamp.  In addition, I also monitor NTP through
nagios (and graph with PNP), and up until the outage, the local skew
was less then a second.

Thanks,
Andrew

On Tue, Jul 7, 2009 at 9:56 PM, Kevin Keanesubscript...@kkeane.com wrote:
 It seems to me that for some reason your system clock has changed by
 about five hours. Did you change your system by any chance from local
 time (Eastern time, probably, based on the five-hour difference) to UTC?
 Or maybe your clock had drifted for a long time. When the clock skew
 becomes too great, NTP refuses to update the time (because there is no
 way to be sure that the time signal isn't the one that's incorrect). If
 you restart NTP, it will set your clock regardless of the clock skew.

 The following immediate check messages probably occurred because
 Nagios thought that these services hadn't been checked for five hours.

 Andrew Noonan wrote:
 I've been testing out Nagios in general to replace our current system
 and I noticed a strange blank in my PNP graphs this morning.  When I
 looked closer, I found that nagios had basically hung for several
 hours.  Then, the log shows a warning of:

 [1246958195] Warning: A system time change of 0d 4h 56m 48s (forwards
 in time) has been detected.  Compensating...

 and then for several hours, messages like:

 [1246958830] Warning: The check of host 'superhost1' looks like it was
 orphaned (results never came back).  I'm scheduling an immediate check
 of the host...

 I'm running nagios 3.0.6 with ndo2db.  The system has under 1000
 services, most of which are nrpe checks to remote hosts.

 The nagios system was not terribly loaded at the time (about 50% idle)
 and mysql did not show any errors at the time.  Typically, the number
 of buffers used is only 2-3 out of the 4096.

 Any ideas as to what this could have been, or how I can detect this
 condition or log to gain more info?  I wouldn't think that this is
 normal, but my Google searches aren't turning up a lot.

 Thanks!

 --
 Enter the BlackBerry Developer Challenge
 This is your chance to win up to $100,000 in prizes! For a limited time,
 vendors submitting new applications to BlackBerry App World(TM) will have
 the opportunity to enter the BlackBerry Developer Challenge. See full prize
 details at: http://p.sf.net/sfu/blackberry
 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when reporting 
 any issue.
 ::: Messages without supporting info will risk being sent to /dev/null



 --
 Kevin Keane
 Owner
 The NetTech
 Find the Uncommon: Expert Solutions for a Network You Never Have to Think 
 About

 Office: 866-642-7116
 http://www.4nettech.com

 This e-mail and attachments, if any, may contain confidential and/or 
 proprietary information. Please be advised that the unauthorized use or 
 disclosure of the information is strictly prohibited. The information herein 
 is intended only for use by the intended recipient(s) named above. If you 
 have received this transmission in error, please notify the sender 
 immediately and permanently delete the e-mail and any copies, printouts or 
 attachments thereof.


 --
 Enter the BlackBerry Developer Challenge
 This is your chance to win up to $100,000 in prizes! For a limited time,
 vendors submitting new applications to BlackBerry App World(TM) will have
 the opportunity to enter the BlackBerry Developer Challenge. See full prize
 details at: http://p.sf.net/sfu/Challenge
 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when reporting 
 any issue.
 ::: Messages without supporting info will risk being sent to /dev/null


--
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time, 
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when

Re: [Nagios-users] Nagios lockup for about 8.5 hours

2009-07-09 Thread Andrew Noonan
Not that I can determine.  We run everything in Central time, and the
nagios user is currently running in that TZ.  Actually, now that I
think about it, it's a fairly moot point, as the logs that I listed
don't have a timestamp, they are epoch stamps.  Since those show a ~5
hour jump, this could not be a TZ change, but either a pause with
Nagios, or the system clock would have to be changed, as that would
effect the epoch values.  Given the consistency of the 5-minute cron
entries I mentioned earlier, I don't think a system clock change
happened either.

On Thu, Jul 9, 2009 at 8:56 AM, Kevin Keanesubscript...@kkeane.com wrote:
 Has the time zone that Nagios runs under changed, maybe? That would not
 affect the log files or NTP, since both usually always run on UTC.

 Andrew Noonan wrote:
 Sorry Kevin, I was out yesterday or I would have responded earlier.  I
 don't think that's the case.  I forgot to mention it in the earlier
 email, but I checked the log files of a periodic cron job that also
 runs on the same server every 5 minutes, and its logs show an
 uninterrupted timestamp.  In addition, I also monitor NTP through
 nagios (and graph with PNP), and up until the outage, the local skew
 was less then a second.

 Thanks,
 Andrew

 On Tue, Jul 7, 2009 at 9:56 PM, Kevin Keanesubscript...@kkeane.com wrote:

 It seems to me that for some reason your system clock has changed by
 about five hours. Did you change your system by any chance from local
 time (Eastern time, probably, based on the five-hour difference) to UTC?
 Or maybe your clock had drifted for a long time. When the clock skew
 becomes too great, NTP refuses to update the time (because there is no
 way to be sure that the time signal isn't the one that's incorrect). If
 you restart NTP, it will set your clock regardless of the clock skew.

 The following immediate check messages probably occurred because
 Nagios thought that these services hadn't been checked for five hours.

 Andrew Noonan wrote:

 I've been testing out Nagios in general to replace our current system
 and I noticed a strange blank in my PNP graphs this morning.  When I
 looked closer, I found that nagios had basically hung for several
 hours.  Then, the log shows a warning of:

 [1246958195] Warning: A system time change of 0d 4h 56m 48s (forwards
 in time) has been detected.  Compensating...

 and then for several hours, messages like:

 [1246958830] Warning: The check of host 'superhost1' looks like it was
 orphaned (results never came back).  I'm scheduling an immediate check
 of the host...

 I'm running nagios 3.0.6 with ndo2db.  The system has under 1000
 services, most of which are nrpe checks to remote hosts.

 The nagios system was not terribly loaded at the time (about 50% idle)
 and mysql did not show any errors at the time.  Typically, the number
 of buffers used is only 2-3 out of the 4096.

 Any ideas as to what this could have been, or how I can detect this
 condition or log to gain more info?  I wouldn't think that this is
 normal, but my Google searches aren't turning up a lot.

 Thanks!

 --
 Enter the BlackBerry Developer Challenge
 This is your chance to win up to $100,000 in prizes! For a limited time,
 vendors submitting new applications to BlackBerry App World(TM) will have
 the opportunity to enter the BlackBerry Developer Challenge. See full prize
 details at: http://p.sf.net/sfu/blackberry
 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when 
 reporting any issue.
 ::: Messages without supporting info will risk being sent to /dev/null


 --
 Kevin Keane
 Owner
 The NetTech
 Find the Uncommon: Expert Solutions for a Network You Never Have to Think 
 About

 Office: 866-642-7116
 http://www.4nettech.com

 This e-mail and attachments, if any, may contain confidential and/or 
 proprietary information. Please be advised that the unauthorized use or 
 disclosure of the information is strictly prohibited. The information 
 herein is intended only for use by the intended recipient(s) named above. 
 If you have received this transmission in error, please notify the sender 
 immediately and permanently delete the e-mail and any copies, printouts or 
 attachments thereof.


 --
 Enter the BlackBerry Developer Challenge
 This is your chance to win up to $100,000 in prizes! For a limited time,
 vendors submitting new applications to BlackBerry App World(TM) will have
 the opportunity to enter the BlackBerry Developer Challenge. See full prize
 details at: http://p.sf.net/sfu/Challenge
 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists

[Nagios-users] Nagios lockup for about 8.5 hours

2009-07-07 Thread Andrew Noonan
I've been testing out Nagios in general to replace our current system
and I noticed a strange blank in my PNP graphs this morning.  When I
looked closer, I found that nagios had basically hung for several
hours.  Then, the log shows a warning of:

[1246958195] Warning: A system time change of 0d 4h 56m 48s (forwards
in time) has been detected.  Compensating...

and then for several hours, messages like:

[1246958830] Warning: The check of host 'superhost1' looks like it was
orphaned (results never came back).  I'm scheduling an immediate check
of the host...

I'm running nagios 3.0.6 with ndo2db.  The system has under 1000
services, most of which are nrpe checks to remote hosts.

The nagios system was not terribly loaded at the time (about 50% idle)
and mysql did not show any errors at the time.  Typically, the number
of buffers used is only 2-3 out of the 4096.

Any ideas as to what this could have been, or how I can detect this
condition or log to gain more info?  I wouldn't think that this is
normal, but my Google searches aren't turning up a lot.

Thanks!

--
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time, 
vendors submitting new applications to BlackBerry App World(TM) will have 
the opportunity to enter the BlackBerry Developer Challenge. See full prize 
details at: http://p.sf.net/sfu/blackberry
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] check_period help?

2009-06-30 Thread Andrew Noonan
No one has any ideas on this, or am I just not posting the right info?

On Mon, Jun 29, 2009 at 12:19 PM, Andrew Noonananoo...@gmail.com wrote:
 Hi all,

 I've got a service that I'm trying to monitor with different
 thresholds at different times of day.  To do this, I created two
 timeperiods, covering 1:00am to 7:00am, and 7:00am to 1:00am (I
 think), two service templates that each use these periods, and two
 services that do the same check with differing thresholds, each
 inheriting the different service template.  But when I look at the
 scheduling for these two services, it's almost opposite what I think
 it should be.  The 'late' service is next scheduled at 00:00 and the
 'normal' service is scheduled at 1:00am.  I'm running 3.0.6.  The
 other templates used do not change the check_period, except for the
 generic-service template.  Here are the definitions:

  Time periods #
 define timeperiod {
        timeperiod_name                         7a-1a_every_day
        alias                                   from 7:00am to 1:00am
        tuesday                                 00:00-00:59,7:00-24:00
        wednesday                               00:00-00:59,7:00-24:00
        sunday                                  00:00-00:59,7:00-24:00
        thursday                                00:00-00:59,7:00-24:00
        saturday                                00:00-00:59,7:00-24:00
        monday                                  00:00-00:59,7:00-24:00
        friday                                  00:00-00:59,7:00-24:00
        }

 define timeperiod {
        timeperiod_name                         1a-7a_every_day
        alias                                   1:00am to 7:00am
        tuesday                                 01:00-06:59
        thursday                                01:00-06:59
        sunday                                  01:00-06:59
        saturday                                01:00-06:59
        monday                                  01:00-06:59
        friday                                  01:00-06:59
        wednesday                               01:00-06:59
        }


 ### Service Templates ##
 define service {
       name                                     Time-1a_to_7a
       check_period                             7a-1a_every_day
       register                                 0

 }

 define service {
       name                                     Time-7a_to_1a
       check_period                             1a-7a_every_day
       register                                 0

 }

 ### Services #

 define service {
        host_name                       US
        service_description             index
        use
 Freq-15-min-check,graphing_service_pnp,Time-7a_to_1a,generic-service
        check_command                   Check command here
        register                        1
        }

 define service {
        host_name                       US
        service_description             lateindex
        use
 Freq-15-min-check,graphing_service_pnp,Time-1a_to_7a,generic-service
        check_command                   Check command here
        register                        1
        }

 Thanks,
 Andrew


--
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] check_period help?

2009-06-30 Thread Andrew Noonan
Ah, yes.  That would be the danger of using a GUI, I suppose.  You're
absolutely correct, I clicked on the wrong one when setting these up.
Confusing everyone else was just a fun side-effect.  That being said,
I went ahead and changed those so they match up, but I'm still seeing
strange behavior.  The index service is next scheduled at 1:00am, and
the lateindex service is next scheduled at 8:45am (it's 8:40am now
here).  I would have expected those services to be properly
rescheduled to basically the exact opposite.  The lateindex should be
1:00am and the index service should be the currently running service.
And yes, I did check to make sure that the index service is running
the correct service template and vice versa :)  Do I need to do
something to reset the scheduler over then a reload?

Andrew



 This looks decidedly odd. Time-7a_to_1a uses the check_period
 1a-7a_every_day. Either you were very confused when you named
 the timeperiods or you were very confused when you created these
 templates. Or you just want to confuse everyone else ;-)

 I'm guessing they should be reversed, no?

 --
 Andreas Ericsson                   andreas.erics...@op5.se
 OP5 AB                             www.op5.se
 Tel: +46 8-230225                  Fax: +46 8-230231

 Considering the successes of the wars on alcohol, poverty, drugs and
 terror, I think we should give some serious thought to declaring war
 on peace.


--
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] check_period help?

2009-06-29 Thread Andrew Noonan
Hi all,

I've got a service that I'm trying to monitor with different
thresholds at different times of day.  To do this, I created two
timeperiods, covering 1:00am to 7:00am, and 7:00am to 1:00am (I
think), two service templates that each use these periods, and two
services that do the same check with differing thresholds, each
inheriting the different service template.  But when I look at the
scheduling for these two services, it's almost opposite what I think
it should be.  The 'late' service is next scheduled at 00:00 and the
'normal' service is scheduled at 1:00am.  I'm running 3.0.6.  The
other templates used do not change the check_period, except for the
generic-service template.  Here are the definitions:

 Time periods #
define timeperiod {
timeperiod_name 7a-1a_every_day
alias   from 7:00am to 1:00am
tuesday 00:00-00:59,7:00-24:00
wednesday   00:00-00:59,7:00-24:00
sunday  00:00-00:59,7:00-24:00
thursday00:00-00:59,7:00-24:00
saturday00:00-00:59,7:00-24:00
monday  00:00-00:59,7:00-24:00
friday  00:00-00:59,7:00-24:00
}

define timeperiod {
timeperiod_name 1a-7a_every_day
alias   1:00am to 7:00am
tuesday 01:00-06:59
thursday01:00-06:59
sunday  01:00-06:59
saturday01:00-06:59
monday  01:00-06:59
friday  01:00-06:59
wednesday   01:00-06:59
}


### Service Templates ##
define service {
   name Time-1a_to_7a
   check_period 7a-1a_every_day
   register 0

}

define service {
   name Time-7a_to_1a
   check_period 1a-7a_every_day
   register 0

}

### Services #

define service {
host_name   US
service_description index
use
Freq-15-min-check,graphing_service_pnp,Time-7a_to_1a,generic-service
check_command   Check command here
register1
}

define service {
host_name   US
service_description lateindex
use
Freq-15-min-check,graphing_service_pnp,Time-1a_to_7a,generic-service
check_command   Check command here
register1
}

Thanks,
Andrew

--
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null