On Mon, 2 Dec 2002, Thomas Davis wrote: > Phil Radden wrote: > > I have a cluster currently containing about 230 dual-CPU nodes, and > > as soon as gmetad is pointed at it, the load on the gmetad box > > rockets. It appears that it's the rrd updates which are causing it, > > because the CPU is still mainly idle. The particular partition > > holding the rrdbs is on ext3 (with data=ordered) in case that matters > > (kjournald keeps joining gmetad at the top of 'top's output). > > change this to data=writeback > [snip] > and the load will drop to nothing.
Thanks for this Thomas, but unfortunately, although it does (slightly) improve the situation, the change is not significant... The problem is certainly the disk I/O though; I've managed to get the monitoring up and running nicely onto a ramdisk, and the load does indeed drop to zero as you suggest. So my next query: is there a good strategy for getting reliable snapshots of rrd databases so that I can run to a ramdisk but have a persistent copy, only a minute or five out-of-date, which I can go back to if the monitoring box goes down? I note from the archives that others have gone the ramdisk route, so this is presumably a solved problem... Whilst I understand if the original problem I suffered is just put down to 'naff hardware' and ignored, it might be interesting to note that although the box was brought to its knees trying to handle the load, the rrd databases only rarely ever got anything in to them - in a day-long run, I only had about fifty data points stored! It would be great if, in the situation where the full load can't be handled, at least some updates would be completed... I'm guessing that the rrd call is being timed out, whereas letting this one complete and skipping the next one might be preferable!

