this patch fixes "alertafter" and "numalerts" for "traptimeouts". it is
in reply to the two mails at the bottom.
the two changes haven't seemed to break anything else, but just in case
here are the two changes in english:
1. in "&handle_trap_timeout", $sref->{"_consec_failures"}++ gets the
"alertafter NUM" to work .
2. "&call_alert" doesn't send the alert if we pass it "undef" $output or
$retval, so i substituted reasonable values.
now the following woks, where before no alert would be sent if the
heartbeat stopped.
watch remote-group
service heartbeat
traptimeout 10s
period wd {Sun-Sat}
alert test.alert tscanlan
upalert test.alert -u tscanlan
alertafter 2
numalerts 3
-Tom Scanlan
OpenReach, Inc.
Network Operations
office: 732-254-0210 x-6022
cell: 732-682-3365
----
RFP:
-----------------------------------------------------------------------------
Date: Tue, 13 Nov 2001 14:54:22 +0100
From: "Peter Wirdemo (EMW)" <[EMAIL PROTECTED]>
To: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]>
Subject: trap timeout alerts
Hello!
I'm trying to use mon, to do a heartbeat style monitoring.
Why dont i get any alerts when the trap is timed out.
In the mon.cgi i get:
Host Group | Service
------------------------------------
syslog | hearbeat : trap timeout
| (FAILED,NOALERTS)
NOALERTS??????
Mon Version:
$Id: mon 1.27 Sat, 08 Sep 2001 09:42:05 -0400 trockij $
$ProjectVersion: mon-0-99-2.6 $
Config:
watch syslog
service heartbeat
description heartbeat test
traptimeout 30s
trapduration 1s
period wd {Sun-Sat}
alertevery 1h
no_comp_alerts
alert mail.alert me@localhost
upalert mail.alert -u me@localhost
Thanks
/Peter
-----------------------------------------------------------------------------
Date: Wed, 30 Jan 2002 12:53:46 -0500
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: alertevery does not work with traps
I'm having problems getting the alertevery variable to work with traps.
I've seen in this mailing list where others have reported that consecutive
failures do not appear to get incremented withing the trap handling sub
routine (have not yet looked at code myself). However I have not seen any
mention of alertevery not working in this scenario. The alertafter XXm
variable seems to work fine, however people are getting paged every time a
failure occurs and I desperately need to throttle this back.
Relevant portion of my config....
watch trap-webchat
service webchat-useragent
period FIRSTLEVEL: wd {Sun-Sat}
alert audible.alert
alertafter 6m
period SECONDLEVEL: wd {Sun-Sat}
alert bcmail.alert analyst
alertafter 15m
alertevery 10m
period THIRDLEVEL: wd {Sun-Sat}
alert bcmail.alert expert
alertafter 30m
alertevery 10m
period CRISIS: wd {Sun-Sat}
alert bcmail.alert crisis_team
alertafter 30m
numalerts 1
period FOURTHLEVEL: wd {Sun-Sat}
alert bcmail.alert management
alertafter 50m
alertevery 10m
Has anyone successfully gotten traps/alertevery working?
--- mon Mon Feb 25 17:03:21 2002
+++ mon.tom Mon Feb 25 17:15:34 2002
@@ -3975,6 +3975,7 @@
my $sref = \%{$watch{$group}->{$service}};
$sref->{"_failure_count"}++;
+ $sref->{"_consec_failures"}++;
$sref->{"_last_failure"} = $tmnow;
$sref->{"_first_failure"} = $tmnow if ($sref->{"_op_status"} != $STAT_FAIL);
set_op_status ($group, $service, $STAT_FAIL);
@@ -3984,7 +3985,7 @@
push @last_failures, "$group $service $tm $sref->{_last_summary}";
syslog ('crit', "failure for $last_failures[-1]");
- do_alert ($group, $service, undef, undef, $FL_TRAPTIMEOUT);
+ do_alert ($group, $service, "NO OUTPUT", 1, $FL_TRAPTIMEOUT);
}