I see a lot of transient errors on services and hosts I'm monitoring. Hence finding ways to keep notifications from going out on situations that will resolve themselves are kind of an issue.
I've played with how many failures in a row are needed to cause a notification, and have that set differently for things I'm monitoring across long links (Beijing, say) compared to things I'm monitoring locally or in New York. Of course, one problem with that is that it makes it take longer before a real problem causes a notification. Right now it takes over 15 minutes for the total failure of our link to Beijing to cause a notification. For things that are numeric values, I can play with the critical and warning ranges to potentially reduce false positives. That, at least, doesn't slow down recognition of total failures. Some things just don't seem to fit the Nagios model -- for example it's quite normal for the SQL server to pull 100% of the cpu for periods now and then, but if it goes on too long, *that's* unusual. Hmm; I suppose I could override the number of failures needed to cause a notification in the service definition for htose, couldn't I? There may be some things I should just stop monitoring (there aren't clear-cut "okay" and "bad" behaviors that I can quantify). I guess I'm wondering if there are useful basic approaches to handling this problem that I'm missing, or if I just need to work through the details more carefully. I'm startled at how often I get isolated failures for no apparent reason. Is that normal for most people monitoring services? I think I'm finding my connections time out now and then due simply to load, without the load actually being at all high. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ------------------------------------------------------------------------------ Virtualization & Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null