--On Wednesday, March 19, 2003 9:08 AM +0100 Mark Lawrence <[EMAIL PROTECTED]> wrote:
By the way David, it looks like your patch changes the content of the standard input given to the alert, but does it actually modify the parameters for the MON_LAST.. variables?
Ahh, thats what I get for posting a patch without looking completely at the results.
You're right, that patch changes the standard input passed to the alert (which is what most alerts I've seen actually output as their detail message.) MON_LAST_OUTPUT remains the *previous* output, which seems logical, given the variable name. This provides the ability for the alert to include both, if it so desires.
Ever since my first upalert I've felt that something is not quite right about the information presented. The other people in my department have the same feeling. The message is just confusing - "Is it up? But it says that hosts are still unreachable!"
I agree, and my users had the same response. An OK message that in the body says the server is still down tends to confuse people. In fact I'd say they threatened to lynch our mon team if we didn't fix it.
I think the best answer is to have per host status tracking. This topic came up on the mailing list a while back, and I said I'd write the code. Sadly that rewrite hasn't started yet, and isn't going to start for another couple of months. But the goal will be results similar to the following:
Host A goes down: Old: ALERT for Host A New: ALERT for Host A
Host B goes down: Old: ALERT for Host A and B New: New Alert Type STATUSCHANGEALERT for the hostgroup that says Host B now down, Host A still down
Host B comes back up: Old: ALERT for Host A New: STATUSCHANGEALERT -> Host B OK, Host A still down
Host A gets disabled by a user: Old: on next monitor invocation, UPALERT New: immediate STATUSCHANGEALERT (or perhaps some other new type): -> Host A now disabled, by user X. Remaining Hosts all OK.
This also will allow for several other useful changes, like this:
Host A goes down: same as above
Failure gets ack'ed by user: ACK-ALERT, including the text of the ACK
Host B fails: Old: No Alert (bad! I've changed this in my mon environment) New: STATUSCHANGEALERT: Host B down, Host A still down, but acked.
Host B comes back up: Old: Still no alert (we're still acked) New: STATUSCHANGEALERT: Host B OK, Host A still down, but acked.
Another useful feature will be the ability to avoid the following:
A goes down, ALERT for inability to ping A, A comes back, UPALERT for A, but the host is still booting, so the web server isn't up yet, so we alert for http.
The fix for this will be per host dependency memory. The dependency of group:http on group:ping will be per-host, and have a configurable look-back feature. So you could configure group:http to not alert if the individual host has failed the ping test within the last 5 minutes, for example.
I already have something very much like this in place in my current Mon version, but it's really a hack. (Comparing this host to the summary output of the last failure of the dependency.) It works 90% of the time, but isn't the true correct solution. Per-Host Status tracking is.
Assuming management doesn't direct our efforts elsewhere, I expect the per-host status work will start sometime this summer. And given the strong desire in my user community for better per host tracking, including for logging purposes, I don't expect management to do that.
-David Nolan Network Software Developer Computing Services Carnegie Mellon University
_______________________________________________ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon