Matuskiewicz, Philip wrote: > Hi Markus, > > I agree that it is the job of the plugin to determine the state of the > service, but only when it can return a definitive result. The use case here > is very unique to the MTA. > > My plugin doesn't return 255 (its set to return -1), Icinga automatically > sets it to 255 when it terminates the plugin after a timeout period (for a > reason that I haven't determined yet due to lack of log data). > > The problem is, when the plugin can't determine the current state, it tries > querying Icinga's status file (through MKLiveStatus for the Last State), and > if the monitoring server is under high load, this query times out also. If > this fails, Icinga kills the plugin, the state of the service is set to > whatever the timeout state is in the configuration file (I set it to > Unknown). The core ONLY lets you set the state to one of the 4 options (OK, > Warning, Critical, or Unknown), there is no way to not change the previous > state.
even if the core set an alarm signal (the infamous service check timeout warning) you can set such a signal in your plugin as well, applying your own timeout, and setting the previous state afterwards. e.g. perl-ish # Setting timeout $SIG{ALRM} = sub { print "$NAME timed out after $opt{timeout} seconds\n"; exit $UNKNOWN; }; alarm $opt{timeout}; so it's really up to you and your plugin to fix the overall behaviour until the core pulls the alarm trigger. still, it does not hurt to increase the service check timeout to 120sec in various environments, letting the underlaying check plugins work a bit more "efficient" in a bigger time window. > > In the MTA's use case, If a bus was previously marked as up or down, its > state will become Unknown (as configured now), and send a round of > notification emails due to the state change (and for 6,000 buses, this > equates to 12,000 emails each for up and down since RDS tends to have > problems for hours at a time). Furthermore, any previous acknowledgments are > erased and we lost all of our tracking. > > As for your suggestion about the $LASTSERVICESTATE$ macro, I'll attempt that > route, but I'm still concerned that CURL might cause this 255 error to occur > because the script timed out before it could return a status to Icinga. btw - a plugin returning -1 is not a valid exit code, and therefore treated as you described ("out of bounds"). http://docs.icinga.org/latest/en/pluginapi.html kind regards, Michael -- DI (FH) Michael Friedrich Vienna University Computer Center Universitaetsstrasse 7 A-1010 Vienna, Austria email: michael.friedr...@univie.ac.at phone: +43 1 4277 14359 mobile: +43 664 60277 14359 fax: +43 1 4277 14338 web: http://www.univie.ac.at/zid http://www.aco.net Lead Icinga Core Developer http://www.icinga.org ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ icinga-users mailing list icinga-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/icinga-users