Happy to report that the tmpfs solution works like a charm! Unfortunately, a couple of our subnets were taken down for upgrades yesterday, so I'm only watching about 2500 nodes instead of the full 3000, but my Ganglia server is cruising along...average load is about 0.5...highest peak has been 1.06. Memory and network aren't an issue at all. The full 3000 won't be a problem, and I'm sure we've even got lots of room to grow. Awesome!
Huge thanks to Matt Massie and Jason Smith for all their help! Steve Gilbert Unix Systems Administrator [EMAIL PROTECTED] -----Original Message----- From: matt massie [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 17, 2003 5:02 PM To: Steve Gilbert Cc: '[email protected]' Subject: Re: [Ganglia-general] Ganglia architecture and gmond load steve- the single biggest problem with scaling gmetad is disk i/o problems. what type of filesystem are you writing the gmetad RRDs to? most people have had very good luck using a Ram-based filesystem and then periodically syncing the data to disk. for example in linux, % mount -t tmpfs tmpfs /mnt now the /mnt directory is a ram-backed filesystem. if the machine is rebooted however all the data is lost. so you will need to write the contents of that filesystem to disk every now and then. -matt Today, Steve Gilbert wrote forth saying... > From: Steve Gilbert <[EMAIL PROTECTED]> > To: "'[email protected]'" > <[email protected]> > Date: Wed, 17 Sep 2003 16:45:26 -0700 > Subject: [Ganglia-general] Ganglia architecture and gmond load > > Hi folks, > > I don't know if I'm just trying to push Ganglia to more than it can handle > or if I'm doing something wrong, but no matter how I design my Ganglia > structure, gmetad seems to always crush the machine where it runs. Here's > an overview of my environment: > > Ganglia 2.5.4 > All hosts involved are running RedHat 7.2 > RRDtool version 1.0.45 > > I have 16 subnets, each with 200 machines give or take a few. I estimate > around 3000 nodes total. Some of these are dual P3, some are single P4, and > a few random Xeon and Itanium nodes. Every node is running gmond, and > that's running fine. > > Each subnet has a "master" node that is a dual P3 1.3GHz. This box provides > DNS, NIS, and static DHCP for the subnet. Normal load on these machines is > very, very minimal. > > My first attempt was to set up a single dedicated Ganglia machine running > gmetad, Apache, and the web frontend. In this machine's gmetad.conf file, I > listed each of the "master" nodes in the subnets as data sources. I thought > having one box collect all the data and store the RRD files would be great. > Well, this was a bad idea...the box (a P4 with 2GB RAM) was absolutely > crushed...load shot up to 8.5, and all the graphs continually had gaps in > them. > > So my next attempt to was to install gmetad on each of the "master" nodes. > I would have this gmetad collect data for the subnet, and then run another > gmetad on my Ganglia web machine to just talk to these 16 other gmetads. I > don't really like having to now backup 16 machines, but I've had problems > before with trying to store RRD files on an NFS mount, so I decided not to > try that. This isn't working all that great, either...the gmetad on these > "master" nodes (collecting data from ~200 hosts each) is also causing a > pretty high load...the boxes now stay around 2-3 load points all the time > and sometimes slows down other operations on the box. > > Am I doing something wrong, or is gmetad really this much of a resource hog? > Anyone else trying to use Ganglia to monitor 3000 machines? Am I asking too > much? Thanks for any insight. > > Steve Gilbert > Unix Systems Administrator > [EMAIL PROTECTED] > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Ganglia-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/ganglia-general >

