On Thu, 21 Nov 2002, matt massie wrote: > gmond isn't built to have multiple clusters in the xml output but gmetad > is specifically designed for that. > just put your smaller clusters on unique multicast channels (e.g. compute > cluster on "239.2.11.70" and tape robots on "239.2.11.71" etc etc) > then just use gmetad to pull each of the sources together into a single > xml image with multiple cluster groups.
Thanks for that Matt, that's what I suspected. I've rearranged things here so that I can work that way for now, and have been able to collect stats for a couple of small groups. Unfortunately, I'm now hitting the same problem as Sumanth Jannyavula-Venk, who posted here on 22nd October - see http://sourceforge.net/mailarchive/forum.php?thread_id=1219650&forum_id=7186 I have a cluster currently containing about 230 dual-CPU nodes, and as soon as gmetad is pointed at it, the load on the gmetad box rockets. It appears that it's the rrd updates which are causing it, because the CPU is still mainly idle. The particular partition holding the rrdbs is on ext3 (with data=ordered) in case that matters (kjournald keeps joining gmetad at the top of 'top's output). Also potentially of interest, the following messages keep getting sent to syslog: Nov 28 14:08:41 kick /usr/sbin/gmetad[28197]: RRD_update: illegal attempt to update using time 1038492521 when last update time is 1038492521 (minimum one second step) Nov 28 14:08:41 kick /usr/sbin/gmetad[28197]: summary_RRD_update: illegal attempt to update using time 1038492521 when last update time is 1038492521 (minimum one second step) Strangely, although the timestamps of the rrdbs are being continuously updated, the web frontend is showing no stats except for a single entry or two near the beginning of the gmetad run. A valid XML feed can be obtained from the boxes of the big cluster, although it's worth noting that it's nearly 1MB in size! The gmond boxes appear to be carrying on happily. [Using ganglia 2.5.1 on RH72 boxes, gmetad box is single-CPU 1.4GHz Athlon] I hope I've provided enough information to help pin down the problem! Regards, Phil

