[EMAIL PROTECTED] writes:
> 
> Currently, when I use the "depend" keyword to suppress alerts after a
> failure, I occasionally get an alert from the dependent monitor because the
> service it depends on goes down before its monitor is next run.

This problem has been discussed several times on the list in the past.
The typical "fix" is to check the lower-level services at least
twice as often as the upper-level ones - for example, I check ping
every 30 or 45 seconds, and TCP layer services no more than every
3 minutes.  Of course, this just drastically shortens the window
for the "extra" alarms, when we'd rather eliminate it entirely.

How to fix this properly?  Here's one suggestion (clipped from
a mon discussion in March 2001):

Ed Ravin wrote:
> *) "make sure all dependencies are current before alerting" - before
> sending an alert for a problem, make sure that all tests that the currently
> failed test depends on have been tested AFTER the item in question.  For
> example, if mailserver:smtp depends on mailserver:ping and mailrouter:ping,
> and mailserver:smtp has failed at 10:00 PM, suppress the alert for
> mailerserver:smtp until the other two tests have been run.

Jim Trocki wrote:
> Good behavior. This minimizes alerts, imposes no extra burdon on
> the network, and minimal burdon on the CPU (a few extra conditionals
> evaluated, depending on how elaborate your dependencies are).  The
> trade-off is that it delays the alert for a failure in order to be sure
> all the dependencies are satisfied. If you do not carefully configure
> your poll intervals, you'll surely get delayed alerts.  This isn't bad,
> however. It's probably a good practice to sort your dependencies into
> a tree, where the services which are root dependencies get checked more
> frequently than the lower levels.

Back in the present [EMAIL PROTECTED] wrote:
> It would be nice if there was some way to force mon to run all of the
> monitors in a dependency tree, starting from the top of the tree, when an
> error is detected. This would completely eliminate any spurious alerts and
> make it clear what the underlying problem is.

That was also discussed back in March:

Jim Trocki wrote:
> *) "force all dependencies to be tested before alerting"

> Not good behavior. It accomplishes the task of minimizing alerts due
> to dependency problems, but at an extreme cost when failures occur.
> I can imagine a not-too-complicated setup where frequently failing leaf
> nodes trigger large system load because all the service checks of the
> dependency graph get scheduled. The trade-off is that it minimizes the
> delay between the time a failure is detected and the alert is sent,
> at the expense of lots of extra CPU cycles.

I had also proposed a compromise, of sorts:

> *) "depend_maxage parameter" - allow user to specify in mon.cf how "old"
> a dependency can be before it should be re-tested (or waited for) - for
> example, "depend_maxage 30" means that if a dependency for a currently
> failed test has not been tested in the last 30 seconds, make sure it is
> re-tested before alerting.

I think this parameter, if implemented, would let the system
administrator specify exactly how much tradeoff they wanted
between prompt alerts and accurate alerts.  Here's a possible
specification:

    depend_maxage SECONDS  - when a service fails, do not send
  an alert unless all of the services it depends on have also
  been tested within the past SECONDS seconds.  If SECONDS is zero,
  then the alert for this service is not sent until the scheduler
  detects that all dependent tests have been run at least once
  since the first_failure time of this service.

Now all we need is someone who wants to code it!

Reply via email to