Re: [Nagios-users] Host check doesn't wait in-between checks.

Jan Kratochvil Tue, 06 Jun 2006 12:01:47 -0700

Hi,

it is being discussed here again and again. Some working solution(s) attached.



Regards,
Lace


On Tue, 06 Jun 2006 20:37:04 +0200, Josh Konkol wrote:
> Kyle Tucker <kylet <at> panix.com> writes:
> 
> > 
> > > 
> > > What is the point of setting a Host max_check_attempts if 
> Nagios isn't going to
> > > wait in between checking?
> > 
> > I too wish I had more time between host checks and you've prompted me
> > to try something in my check_alive script. Would this snippet do the 
> > trick?
> > 
> > TIMEGAP=10  # number of seconds before host checks
> > 
> > if [ $HOSTSTATE$ = "DOWN" -a $HOSTSTATETYPE$ = "SOFT" ]
> > then
> >     sleep $TIMEGAP
> > fi
> > 
> > exit 2
> > 
> > I would have used ...
> > 
> > if [ $HOSTSTATE$ = "DOWN" -a $HOSTATTEMPT$ -lt 
> $MAXHOSTATTEMPT$ ]
> > then
> >     sleep $TIMEGAP
> > fi
> > 
> > ... but oddly there seems to be no $MAXHOSTATTEMPT$ macro.
> > 
> 
> 
> OK so the simple solution would be to change my check-host-command 
> to include a sleep 10 at the end.  i.e.:
> 
> $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% 
> -p 3 && sleep 10
> 
> You think that would work?
> 
> Josh

--- Begin Message ---

Hi,

script for using service "Connectivity" to detect HOST-DOWN/UP states:

I had many "false positive" host-down alerts. It was reported here many times
but usually no real cause was found, like (just guessing):
        http://sourceforge.net/mailarchive/message.php?msg_id=8739666
        http://sourceforge.net/mailarchive/message.php?msg_id=1758384

By the attached debugging script "check-host-alive-debug" I found the problem
is that occasionally the host is really unreachable but only for periods like
under 1 minute. This is usual due to the Internet routing and it annoys me to
be alerted for it.

Unfortunately Nagios stops its operations and only checks the single host
host->max_check_attempts times without any delays to determine if it is alive.

Tried delaying all (or later just the failed) checks but either the sensitivity
was still too high (and too short failures were still reported) or the total
time Nagios got blocked during multiple real long-term hosts failures blocked
out the Nagios checking services completely.

Attached script delegates the host-alive checking to the standard Nagios
services checking and if the service check will detected after some time
define service {
        service_description     Connectivity
        max_check_attempts      20
        normal_check_interval   5
        retry_check_interval    5
        ...
}
that the SERVICE is really down THEN ONLY the HOST is immediately declared as
DOWN.
define host {
        check_command           check-host-alive
        max_check_attempts      1
        ...
}
All the services have explicit dependencies defined such as:
define servicedependency {
        host_name                       SAME-HOSTNAME
        dependent_host_name             SAME-HOSTNAME
        service_description             Connectivity
        dependent_service_description   Total Processes
        execution_failure_criteria      u,c,p
        notification_failure_criteria   u,c,p
}

You can also use service "SSH" (second try) instead of "Connectivity".
The script has some trivia hardcoded pathnames, check yourself.
The real solution is fix the Nagios scheduling but this way was easier for me.

Development paid by the courtesy of JK Labs s.r.o.


Regards,
Jan Kratochvil

#! /usr/bin/perl
use strict;
use warnings;

my %dat;
my %cache;

sub fetch
{
        my $open;

        local *DAT;
        do { open DAT,$_ or die "Open \"$_\": $!"; } for 
$ENV{"HOME"}."/nagios/var/log/nagios/status.dat";
        local $_;
        while (<DAT>) {
                next if /^\s*#/;
                next if /^\s*$/;
                if (/^(\w+)\s+{\s*$/) {
                        die "Already open: $open" if $open;
                        $open=$1;
                        next;
                        }
                if (/^\s*}\s*$/) {
                        die "Nothing open" if !$open;
                        $open=undef();
                        next;
                        }
                if (/^\s*(\S+)\s*=\s*(.*?)\s*$/) {
                        my($left,$right)=($1,$2);
                        die "Nothing open" if !$open;
                        if ($open eq "host" || $open eq "service") {
                                $open.="::$right" if $left eq "host_name";
                                }
                        if ($open=~/^service::[^:]+$/) {
                                $open.="::$right" if $left eq 
"service_description";
                                }
                        next if $open=~/^service::[^:]+$/;
                        die "Redefined: ${open}::$left" if exists 
$dat{$open}{$left};
                        $dat{$open}{$left}=$right;
                        next;
                        }
                die "Unknown line: $_";
                }
        close DAT or die "Close: $!";
        die "Stale open" if $open;

        local *CACHE;
        do { open CACHE,$_ or die "Open \"$_\": $!"; } for 
$ENV{"HOME"}."/nagios/var/log/nagios/objects.cache";
        local $_;
        while (<CACHE>) {
                next if /^\s*#/;
                next if /^\s*$/;
                if (/^define\s+(\w+)\s+{\s*$/) {
                        die "Already open: $open" if $open;
                        $open=$1;
                        next;
                        }
                if (/^\s*}\s*$/) {
                        die "Nothing open" if !$open;
                        $open=undef();
                        next;
                        }
                if (/^\s*(\w+)\t(\S.*?)\s*$/) {
                        my($left,$right)=($1,$2);
                        die "Nothing open" if !$open;
                        next if $open!~/^host\b/ && $open!~/^service\b/;
                        if ($open eq "host" || $open eq "service") {
                                $open.="::$right" if $left eq "host_name";
                                }
                        if ($open eq "service") {
                                $open.="::$right" if $left eq 
"service_description";
                                }
                        next if $open=~/^service::[^:]+$/;
                        die "Redefined: ${open}::$left" if exists 
$cache{$open}{$left};
                        $cache{$open}{$left}=$right;
                        next;
                        }
                die "Unknown line: $_";
                }
        close CACHE or die "Close: $!";
        die "Stale open" if $open;
}

fetch();

#use Data::Dumper;
#print Dumper(\%dat);
#print Dumper(\%cache);

my %ip_to_hostname;
my %proxy_ip_to_hostname;
while (my($key,$val)=each(%cache)) {
        next if $key!~/^host::([^:]+)$/;
        my $hostname=$1;
        next if !$val->{"address"};
        $ip_to_hostname{$val->{"address"}}=$hostname;
        if (my $parent_hostname=$val->{"parents"}) {
                my $parent_record=$cache{"host::$parent_hostname"}
                                or die "Neni zaznam pro: $parent_hostname pro: 
$hostname";
                my $parent_ip=$parent_record->{"address"}
                                or die "Neni adresa pro: $parent_hostname pro: 
$hostname";
                $proxy_ip_to_hostname{$parent_ip}{$val->{"address"}}=$hostname;
                }
        }
#print Dumper(\%ip_to_hostname);
#print Dumper(\%proxy_ip_to_hostname);

die "Expecting -H <IP> [--proxy <IP>] [-p <port>] [...]" if @ARGV<2 || shift ne 
"-H";
my $ip=shift;
my $proxy;
do { shift; $proxy=shift;                          } if $ARGV[0] && $ARGV[0] eq 
"--proxy";
do { shift; ($proxy ? $proxy : $ip).=" -p ".shift; } if $ARGV[0] && $ARGV[0] eq 
"-p";

my $hostname;
if (!$proxy) {
        die "Unknown ip: $ip" if !($hostname=$ip_to_hostname{$ip});
        }
else {
        die "Unknown ip: $ip over proxy: $proxy" if 
!($hostname=$proxy_ip_to_hostname{$proxy}{$ip});
        }
my $state;
my $state_service;
for (qw(Connectivity SSH)) {
        # "current_state"
        next if 
!defined(($state=$dat{"service::${hostname}::$_"}{"last_hard_state"}));
        $state_service=$_;
        last;
        }
die "No state for: $hostname" if !defined $state;
die "Weird state: $state" if $state!~/^[012]$/;

print "State $state_service $state copy for IP $ip hostname $hostname\n";

exit $state;

#! /bin/bash
date="`date --rfc-3339=seconds`"
t=/tmp/check-host-alive-tmp.$$
rm -f $t.*
/home/jklabs/nagios/libexec/check-host-alive-orig "$@" >$t.1 2>$t.2
rc=$?
echo >>/tmp/check-host-alive.log "$date: `date --rfc-3339=seconds`: rc=$rc $* 
1{`tr '\n' '|' <$t.1`} 2{`tr '\n' '|' <$t.2`}"
cat $t.1
cat >&2 $t.2
rm -f $t.*
exit $rc

--- End Message ---

--- Begin Message ---

Hi Jan,
the other way is to use "on-demand Macros" like check_cluster does.
Have a look on http://nagios.sourceforge.net/docs/2_0/clusters.html

An Example:

Every Host has an Service called "Connectivity" witch reflects the Host State.
This Services is executed periodicaly every 2 minutes.

The Host check_command is defined in this way:

define command{
command_name check-host-adaptive
command_line $USER1$/check_dummy $ARG1$ "$ARG2$"
}

The command in your hosts.cfg looks like

define host{
.....
check_command
check-host-adaptive!$SERVICESTATEID:server1:Connectivity$!$SERVICEOUTPUT:server1:Connectivity
host_name server1
alias server1
address 172.0.0.1
.....
}

In this way the host check_command produces the same state as the
"Connectivity" Service on the same Host.
This is done by the "on-demand" Macros.

This ist still a quick hack, but it works ;-)
This is IMHO the fastest way without parsing status.dat.

Greets from Germany
Jörg Linge

-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd_______________________________________________
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting
any issue.
::: Messages without supporting info will risk being sent to /dev/null

--- End Message ---

--- Begin Message ---

Am Montag 22 Mai 2006 12:03 schrieb Joerg Linge:
> Am Montag 22 Mai 2006 11:39 schrieb Jan Kratochvil:
> > > This ist still a quick hack, but it works ;-)
> > > This is IMHO the fastest way without parsing status.dat.
> >
> > Do you have it tested? I do not think it will work. The goal is to
> > reflect the field "last_hard_state", not the field "current_state".
> > "SERVICESTATEID" unfortunately corresponds to "current_state" while
> > "last_hard_state" has no corresponding macro.
>
> This was only a quick example!
> $SERVICESTATETYPE$ gets the state type ( HARD/SOFT )

Hmm, $SERVICESTATETYPE$  always reports an HARD state.

If you use $SERVICEATTEMPT:server1:Connectivity$ you will get the corrent 
check attempts for your Connectivity Service.

So you can return an CRITICAL state to your host check_command if the maximun 
number of service checks are reached.

In my test environment this works fine at the moment. 

I have written a small plugin which return CRITICAL if 
$SERVICESTATEID != 0 && $SERVICEATTEMPTS >= 3

Jörg

pgpZtKmSIsAPG.pgp
Description: PGP signature

--- End Message ---

_______________________________________________
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Host check doesn't wait in-between checks.

Reply via email to