Leif Nixon wrote:
Steven Wagner <[EMAIL PROTECTED]> writes:
Yes, that's what I did last week. It ain't no fun. Nagios' handling
of passive service checks isn't flexible enough. And passive host
checking Just Isn't Done.
Once again, considering you have the source at your disposal, I'm sure you
could work something out. Spackling in passive host checking is easier
than some of the alternatives. :)
The ganglia philosophy so far has been to make things work with a
minimum of tweaking. Having set up three different open source
monitoring system over the last few years, it seems to me that it's
nearly impossible to set up notifications without a *LOT* of tweaking.
So there's two ways of doing this, I think:
* We need config files. Lots of them. [WHOOSH!] (1 per node?)
* Monitoring thresholds are hard-coded as part of each metric definition.
Well, each metric could certainly come with default thresholds, and if
you use some inheritance mechanism you could rather easily specify
thresholds for all your cluster nodes:
In a per-node model you have to distribute the new config file to n nodes
every time you change something. Which is kind of a bummer, since (as I
mentioned before) it seems that there's always an initial tweaking period
with notifying mechanisms where you're changing the config every five minutes.
On the other hand, this will encourage people in ad-hoc cluster
environments to put together a file distribution mechanism. :)
That way, you only need to specify any exceptions from the defaults.
Whooshy enough?
The mental image I was actually going for was the loading program from The
Matrix, substituting endless streams of configuration directives for racks
o' firearms...
What would seem to take some consideration is how to keep track of the
metacluster state.
That part's easy, it goes in the gmetad config file. gmetad inherits
per-node and per-metric attributes from the data source, but uses
a "generic" section of the gmetad config file for metacluster notification
properties and the data source section of the gmetad config to determine
cluster notification properties.
You need state tracking, since you want flank detection so you trigger
the klaxons only when a node goes down, and *not* every five minute
during its downtime. And for most metrics you want some hysteresis
mechanism so you don't get continuous notifications if a metric
fluctuates around the threshold.
*That* stuff needs to be in gmetad (or a program that fills the same niche,
querying one or more metadaemons or monitoring cores, chewing on the XML
data, and doing something with it). Flap thresholds, contact info, etc.,
etc., etc. ...
Sounds like you're volunteering to write it. :P