One frustration I have is that checks report a DOWN status if they experience any error whatsoever, a timeout for exapmle. One example is the "CountFiles" external COM check, which checks for the existence of certain files, and in my case should alarm if any (count > 0) files are found.
What it does is alarm if the files are "not not found", i.e. if for any reason it can't count the files it thinks "Aha! DOWN" and also sends an alarm. NTProcess checks seem to be like this too, if they fail to disprove the negative they interpret this as a positive and send an alarm, rather than handle the error (timeout, logon failure, etc). Sometimes I don't have a couple of weeks to run a new check in the test envinronment before it is needed in production - any ideas on how to avoid false positives with CountFiles, or even generally? To pick up on yesterday's thread of ideas around alarm management etc in future versions, it might be useful to make communicating via the same alarm mechanism possible. False positives have a large "crying wolf" impact on the credibility of alarms, which reduces the reaction of the team and the effectiveness of Servers Alive. Sometimes I rely more on external scripts returning errorlevels, which I can tune more finely, even if the check type is built in. //Steve To unsubscribe send a message with UNSUBSCRIBE as subject to [email protected]
