Steven Wagner <[EMAIL PROTECTED]> writes:

> And, of course, the direction you're probably already going in -
> writing an app in Perl (or Python or Java or C or C++ or Pascal or
> Prolog or Pilot or COBOL or ... ) to connect to gmetad, parse the
> output, and then fire off a stream of passive updates to
> Nagios/Netsaint via nsca.

Yes, that's what I did last week. It ain't no fun. Nagios' handling
of passive service checks isn't flexible enough. And passive host
checking Just Isn't Done.

> The ganglia philosophy so far has been to make things work with a
> minimum of tweaking.  Having set up three different open source
> monitoring system over the last few years, it seems to me that it's
> nearly impossible to set up notifications without a *LOT* of tweaking.
> 
> So there's two ways of doing this, I think:
> 
> *  We need config files.  Lots of them.   [WHOOSH!]  (1 per node?)
> *  Monitoring thresholds are hard-coded as part of each metric definition.

Well, each metric could certainly come with default thresholds, and if
you use some inheritance mechanism you could rather easily specify
thresholds for all your cluster nodes:

  metacluster:
    warn if load > 5              # default load threshold
    warn if last_heard_from > 60  # default heartbeat threshold

    cluster foo:
      warn if load > 10  # Twin cpu nodes in this cluster, so
                         # double load threshold

      host odd_one:
        warn if load > 5  # Except for this node


That way, you only need to specify any exceptions from the defaults.
Whooshy enough?

What would seem to take some consideration is how to keep track of the
metacluster state.

You need state tracking, since you want flank detection so you trigger
the klaxons only when a node goes down, and *not* every five minute
during its downtime. And for most metrics you want some hysteresis
mechanism so you don't get continuous notifications if a metric
fluctuates around the threshold.

-- 
Leif Nixon                                    Systems expert
------------------------------------------------------------
National Supercomputer Centre           Linkoping University
------------------------------------------------------------

Reply via email to