Hello, Since upgrade to 3.0.3 from ports on FreeBSD 6.2-RELEASE. It's used to monitor approx. 300 services on 150 hosts. The problem is, that sometimes random services get stuck in warning/critical state. I've checked scheduling queue in CGI and it seems right, but as I see, it says only about hosts, not services. I probably should also mention, that nagios host sometimes has quite high load (up to 2.0 on uniprocessor machine), as a result of monitoring scripts. Next thing is that monitoring host has big constant clock skew that I can't get rid of (time runs faster, ca. 5s for every 2 minutes, this gets corrected by ntpdate every 2 minutes).
- example host status: Host Status: UP (for 1d 9h 35m 23s) Status Information: PING OK - Packet loss = 0%, RTA = 15.33 ms Performance Data: rta=15.333000ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0 Current Attempt: 1/5 (HARD state) Last Check Time: 02-10-2008 10:32:44 Check Type: ACTIVE Check Latency / Duration: 0.732 / 0.152 seconds Next Scheduled Active Check: 02-10-2008 10:37:54 Last State Change: 01-10-2008 01:02:43 Last Notification: N/A (notification 0) Is This Host Flapping? N/A In Scheduled Downtime? NO Last Update: 02-10-2008 10:37:58 ( 0d 0h 0m 8s ago) - service status on the host above: Current Status: CRITICAL (for 2d 2h 55m 9s) Status Information: PING CRITICAL - Packet loss = 0%, RTA = 313.92 ms Performance Data: Current Attempt: 3/5 (SOFT state) Last Check Time: 30-09-2008 07:44:36 Check Type: ACTIVE Check Latency / Duration: 0.116 / 4.657 seconds Next Scheduled Check: 30-09-2008 07:45:36 Last State Change: 30-09-2008 07:44:36 Last Notification: N/A (notification 0) Is This Service Flapping? N/A In Scheduled Downtime? NO Last Update: 02-10-2008 10:39:43 ( 0d 0h 0m 2s ago) Notice the Last Check Time, the service status is two days old. Problem can be resolved by nagios restart, or by "Re-schedule the next check of this service" in the CGI. Parent host is up, and the service has no parent. Is there any configuration directive that may cause service check to be dropped by the scheduler? - configuration related to host: define host { register 0 name generic-host check_command check-host-alive notification_period 24x7 notification_options d,u,r max_check_attempts 5 notification_interval 240 notifications_enabled 1 event_handler_enabled 1 flap_detection_enabled 1 process_perf_data 1 retain_status_information 1 retain_nonstatus_information 1 } define host { register 0 use generic-host name generic-device-ext contact_groups noc,tech } define host { use generic-device-ext host_name ... alias ... address ... parents ... } - service: define service { name generic-service active_checks_enabled 1 passive_checks_enabled 0 parallelize_check 1 obsess_over_service 1 check_freshness 0 notifications_enabled 1 event_handler_enabled 1 flap_detection_enabled 1 process_perf_data 1 retain_status_information 1 retain_nonstatus_information 1 is_volatile 0 check_period 24x7 max_check_attempts 5 normal_check_interval 5 retry_check_interval 1 notification_interval 120 notification_period 24x7 notification_options u,w,c,r register 0 } define service { use generic-service name nrpe max_check_attempts 5 normal_check_interval 5 retry_check_interval 1 register 0 } define service { use nrpe name nrpe-ping service_description PING-nrpe check_command check_nrpe_ping!... contact_groups noc-prio host_name ... } Thanks in advance for any clues. Best regards, Bartłomiej Korupczyński ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null