Hi,

I've been working on an HPC system recently (~600 nodes) which uses a
combination of Ganglia and Nagios -- Ganglia for metrics gathering and
Nagios for generating alerts (more or less). One of the constraints on the
monitoring implementation was to ensure that any monitoring agents had as
small a footprint as possible on the client systems, which is why Ganglia's
gmond was chosen over the NRPE / Nagios plugin combination. We chose this
implementation based on the availability of the ganglia-nagios plugin, but
found that because of the way it is implemented it quickly choked the life
out of the monitoring server (5500 probes over a 5 minute interval each of
which downloads a full copy of the XML file from the GMond collector's
accept channel is quite an overhead). The gmond collector was quickly
overwhelmed and we started seeing gaps in the graphs presented by the WWW UI
-- the RRD files were being starved.

So I had a quick look at pulling data directly from the RRD files instead --
I wrote a shell script that used rrdtool to extract the last entry from the
DB file and present the result to Nagios. It made a huge difference to the
performance of the monitoring server and restored Ganglia to its normal
working condition. And Nagios can now generate alerts based on the
information gathered by gmond.

I've attached a copy of this script along with a python equivalent in the
hope that it might prove useful to other cluster admins. It works well in
our environment -- the shell script was running for about a month before we
switched over to the python version (which has been in place for about 2
weeks).

Regards,

Malcolm.

Attachment: check_ganglia_rrd.tgz
Description: GNU Zip compressed data

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to