I am currently using alerts keyed on exit returns to send email to
different parties depending on the type of error. For instance, I not
infrequently get SNMP timeouts on heavily loaded systems, and I'd like to
know about this, but this doesn't constitute an error that the sysadmins
need to deal with. My config file looks something like this:
monitor netsnmp-exec.monitor BLAH BLAH BLAH
period _ANYTIME_
numalerts 1
alert exit=1 mail.alert _UNIX_ADMIN_
alert exit=2 mail.alert _MON_ADMIN_
The problem is, I also want to use upalerts. Currently, upalerts get called
when the exit code changes from non-zero to zero, so there is no way to
direct the upalerts to the relevant parties. Either everyone gets upalerts,
which is very confusing since not everyone has gotten the relevant alerts,
or nobody does.
The thing that would make the most sense would be if I could do something
like this:
monitor netsnmp-exec.monitor BLAH BLAH BLAH
period _ANYTIME_
numalerts 1
alert exit=1 mail.alert _UNIX_ADMIN_
upalert exit=1 mail.alert _UNIX_ADMIN_
alert exit=2 mail.alert _MON_ADMIN_
upalert exit=2 mail.alert _MON_ADMIN_
where "upalert exit=n" would mean "call this upalert if the exit code
changed from n to zero".
The only problem with this approach would be if we went from one failure
state to another before returning to success; e.g. if the monitor went from
exit code 1 (monitor error, _UNIX_ADMIN_ gets alert), to exit code 2
(protocol error, _MON_ADMIN_ gets alert), to exit code 0 (success, but
since last transition was from 2 to 0, only _MON_ADMIN_ gets an upalert). I
guess the right way to handle this would be to keep a list within mon of
current failure states that haven't gone back to zero, and send upalerts to
all of them?
I realize it would probably be a lot of work to hack this into mon, but it
would be a really useful feature.