Leif Nixon wrote:
So, once you've gotten Ganglia to pull in metrics from gazillions of
nodes in umpteen clusters, and got pretty graphs of everything, what
do you use for monitoring the values? I mean, when a machine goes
down, you don't want just a webpage to be updated, you want something
to trigger the klaxons.

I've tried to adapt Nagios (formerly known as Netsaint) for that
purpose, but Nagios doesn't really fit the bill; it's designed to
collect it's own monitoring data and is not very happy with just
being fed data from other sources.



As far as I know, there's nothing off the shelf that does this. But the logical place to start would be writing an event handler into gmetad. "It Wouldn't Be That Difficult..." [tee hee]

An alternative would be to totally hack up MRTG or BB (they've rewritten it to not use shell scripts by now, right?) and slap an XML parser onto its data collection end.

And, of course, the direction you're probably already going in - writing an app in Perl (or Python or Java or C or C++ or Pascal or Prolog or Pilot or COBOL or ... ) to connect to gmetad, parse the output, and then fire off a stream of passive updates to Nagios/Netsaint via nsca.

#1 would be the best overall solution of course, but #3 may be pleasantly doable. Hey, you could probably grab the old gmetad perl source and start there...

But then you gotta figure out a neat way to configure the notifier to deal with a variable number of hosts and clusters and metrics.

The ganglia philosophy so far has been to make things work with a minimum of tweaking. Having set up three different open source monitoring system over the last few years, it seems to me that it's nearly impossible to set up notifications without a *LOT* of tweaking.

So there's two ways of doing this, I think:

*  We need config files.  Lots of them.   [WHOOSH!]  (1 per node?)
*  Monitoring thresholds are hard-coded as part of each metric definition.

There's also the third way, which I didn't include above because it's more of an on-the-roadmap Ganglia-3-type-of-thing. But, what the heck...

The concept of a "senior node" has been bandied about - the node that's been up the longest, basically. It's assumed that this node has the most accurate state of the cluster. It sends copies of its metric tree definitions to new monitoring cores as they join the network. There's no reason it couldn't also send notification config info as well. The notification tree could be treated as another metric tree, so the first node to connect that has a config file present for notification transmits it to the senior node as if they were new metrics. The two most important keys will have to be "source_IP" and "source_last_modified." :)

Hmm, but what if you wanted/needed to change sources? Should the senior node accept a newer notify tree from any source as long as it has a newer last_modified date? Should it only accept it from the original notify tree source? What if the notify tree goes down - wait for DMAX seconds and then prune the entire tree, starting the wait-for-new-tree process, I guess ... that makes sense.

It's definitely not the monitoring core's place to *do* the notifying. But there's no reason it can't supply information provided by the metric/node/cluster maintainer about notification thresholds.

Seems to me that would be pretty snazzy. A new host joins the cluster and the notifying program already knows how to deal with it, without you having to tweak a config file and reset it. (that one sentence was the goal of the last six paragraphs of nonsense)

OK, back to slaving over a hot keyboard.


Reply via email to