On Mon, 2 Dec 2002, Thomas Davis wrote:
> Phil Radden wrote:
> > I have a cluster currently containing about 230 dual-CPU nodes, and
> > as soon as gmetad is pointed at it, the load on the gmetad box
> > rockets.  It appears that it's the rrd updates which are causing it,
> > because the CPU is still mainly idle.  The particular partition
> > holding the rrdbs is on ext3 (with data=ordered) in case that matters
> > (kjournald keeps joining gmetad at the top of 'top's output).
> 
> change this to data=writeback
> [snip]
> and the load will drop to nothing.

Thanks for this Thomas, but unfortunately, although it does (slightly) 
improve the situation, the change is not significant...

The problem is certainly the disk I/O though; I've managed to get the 
monitoring up and running nicely onto a ramdisk, and the load does indeed
drop to zero as you suggest.

So my next query: is there a good strategy for getting reliable snapshots 
of rrd databases so that I can run to a ramdisk but have a persistent 
copy, only a minute or five out-of-date, which I can go back to if the 
monitoring box goes down?  I note from the archives that others have gone 
the ramdisk route, so this is presumably a solved problem...

Whilst I understand if the original problem I suffered is just put down to 
'naff hardware' and ignored, it might be interesting to note that although 
the box was brought to its knees trying to handle the load, the rrd 
databases only rarely ever got anything in to them - in a day-long run, I 
only had about fifty data points stored!  It would be great if, in the 
situation where the full load can't be handled, at least some updates 
would be completed...  I'm guessing that the rrd call is being timed out, 
whereas letting this one complete and skipping the next one might be 
preferable!


Reply via email to