Re: [Ganglia-general] Monitoring

Steven Wagner Fri, 04 Oct 2002 10:14:14 -0700

Leif Nixon wrote:

So, once you've gotten Ganglia to pull in metrics from gazillions of
nodes in umpteen clusters, and got pretty graphs of everything, what
do you use for monitoring the values? I mean, when a machine goes
down, you don't want just a webpage to be updated, you want something
to trigger the klaxons.


I've tried to adapt Nagios (formerly known as Netsaint) for that
purpose, but Nagios doesn't really fit the bill; it's designed to
collect it's own monitoring data and is not very happy with just
being fed data from other sources.

As far as I know, there's nothing off the shelf that does this. But thelogical place to start would be writing an event handler into gmetad. "ItWouldn't Be That Difficult..." [tee hee]

An alternative would be to totally hack up MRTG or BB (they've rewritten itto not use shell scripts by now, right?) and slap an XML parser onto itsdata collection end.

And, of course, the direction you're probably already going in - writing anapp in Perl (or Python or Java or C or C++ or Pascal or Prolog or Pilot orCOBOL or ... ) to connect to gmetad, parse the output, and then fire off astream of passive updates to Nagios/Netsaint via nsca.

#1 would be the best overall solution of course, but #3 may be pleasantlydoable. Hey, you could probably grab the old gmetad perl source and startthere...

But then you gotta figure out a neat way to configure the notifier to dealwith a variable number of hosts and clusters and metrics.

The ganglia philosophy so far has been to make things work with a minimumof tweaking. Having set up three different open source monitoring systemover the last few years, it seems to me that it's nearly impossible to setup notifications without a *LOT* of tweaking.


So there's two ways of doing this, I think:

*  We need config files.  Lots of them.   [WHOOSH!]  (1 per node?)
*  Monitoring thresholds are hard-coded as part of each metric definition.

There's also the third way, which I didn't include above because it's moreof an on-the-roadmap Ganglia-3-type-of-thing. But, what the heck...

The concept of a "senior node" has been bandied about - the node that'sbeen up the longest, basically. It's assumed that this node has the mostaccurate state of the cluster. It sends copies of its metric treedefinitions to new monitoring cores as they join the network. There's noreason it couldn't also send notification config info as well. Thenotification tree could be treated as another metric tree, so the firstnode to connect that has a config file present for notification transmitsit to the senior node as if they were new metrics. The two most importantkeys will have to be "source_IP" and "source_last_modified." :)

Hmm, but what if you wanted/needed to change sources? Should the seniornode accept a newer notify tree from any source as long as it has a newerlast_modified date? Should it only accept it from the original notify treesource? What if the notify tree goes down - wait for DMAX seconds and thenprune the entire tree, starting the wait-for-new-tree process, I guess ...that makes sense.

It's definitely not the monitoring core's place to *do* the notifying. Butthere's no reason it can't supply information provided by themetric/node/cluster maintainer about notification thresholds.

Seems to me that would be pretty snazzy. A new host joins the cluster andthe notifying program already knows how to deal with it, without you havingto tweak a config file and reset it. (that one sentence was the goal ofthe last six paragraphs of nonsense)


OK, back to slaving over a hot keyboard.

Re: [Ganglia-general] Monitoring

Reply via email to