Jason A. Smith wrote:
Some sort of RAM disk is probably the only thing that will be able to
handle the rrd I/O load and allow gmetad to monitor more than a few
hundred nodes. If your gmetad node is running Linux then I would
suggest using tmpfs which basically implements a POSIX filesystem in the
kernel's VFS.
All you have to do is decide how much space ganglia requires and add a
line like the following to /etc/fstab and mount it:
none /var/lib/ganglia/rrds tmpfs \
size=1024M,mode=755,uid=nobody,gid=nobody 0 0
The documentation for tmpfs is located here:
/usr/src/linux/Documentation/filesystems/tmpfs.txt
I am using this with ganglia to monitor over 1300 nodes, split into 10
clusters, with a single gmetad and the load is fairly light. It is only
using about 435MB of RAM to store all of the rrd files, or about
340kB/node. Have you added extra gmetric data to reach 150MB for 275
nodes (559kb each)?
Yes, we also distribute some metric's on our infiniband network cards
and such. And it appears to be more around 135 Mb after making a more
precise calculation ;) (~13 Kb per metric, 43 metrics per host, 275
hosts in the cluster)
To save the data in case of a system crash, I just patched the gmetad
init script to backup and restore the rrds with tar when it is stopped
and started, then use a daily cronjob to restart gmetad every night. I
stop gmetad before backing up otherwise tar complains that one or more
files has changed while it was being read. I have attached the init
patch that I use in case you are interested.
~Jason
So if there is a kernel panic or similar (lets hope not) you only have
the data from your last backup, which was on startup or at midnight,
right? Aren't you loosing big timeframes in case a crash occures? I have
been thinking about a ramdisk too, but are somewhat reluctant on the
data loss in case of a crash. How to maintain a recent (backup) copy to
disk, so that as little data as possible is lost on a crash?
I am also interested because for a project of mine I have to archive the
rrd data for the historical tracking of compute jobs and their
performance. However, copy'ing 135 Mb worth of rrd files every hour - in
less than 15 seconds (metric interval) - is hardly doable. Using a
ramdisk would certainly speed up things. I am tempted though to write an
other tool who archives the data directly from the multicast channel,
but this would mean even more disk access (and load).
- Ramon.