One frustration I have is that checks report a DOWN status if they
experience any error whatsoever, a timeout for exapmle. One example is
the "CountFiles" external COM check, which checks for the existence of
certain files, and in my case should alarm if any (count > 0) files are
found.

What it does is alarm if the files are "not not found", i.e. if for any
reason it can't count the files it thinks "Aha! DOWN" and also sends an
alarm. NTProcess checks seem to be like this too, if they fail to
disprove the negative they interpret this as a positive and send an
alarm, rather than handle the error (timeout, logon failure, etc).

Sometimes I don't have a couple of weeks to run a new check in the test
envinronment before it is needed in production - any ideas on how to
avoid false positives with CountFiles, or even generally? 

To pick up on yesterday's thread of ideas around alarm management etc in
future versions, it might be useful to make communicating via the same
alarm mechanism possible. False positives have a large "crying wolf"
impact on the credibility of alarms, which reduces the reaction of the
team and the effectiveness of Servers Alive. Sometimes I rely more on
external scripts returning errorlevels, which I can tune more finely,
even if the check type is built in. 

//Steve
To unsubscribe send a message with UNSUBSCRIBE as subject to [email protected]

Reply via email to