Joel,

I have replied to your questions below. Just a little background: My setup monitor almost 600 hosts. I also wanted to measure many other metrics beyond the default 30 or so that are built into Ganglia, so I hacked the Ganglia source to compile in a few others and added some cron jobs to report other metrics too. I estimate that I am monitoring about 85 metrics per host.


- Put gmetad rrd files on a ramdisk.


Definitely. When I originally set things up, I had the rrds on an ext3 filesystem. The load on the system was always around 4. I tweaked some of the VM settings in /proc and got the load down to around 3, but that was the best I could do. Turns out that the real killer was the journaling. I put the VM settings back to their default values and remounted the filesystem as ext2. The load dropped to about 0.5! Huge improvement.

But since we had plans to put even more stuff on the server, I decided to play it safe and write to a tmpfs filesystem. For my setup, the metrics use about 600 MB of RAM. I rsync it to disk every 10 minutes, and I modified the start/stop init scripts to restore/save data. I think when I moved to a ram-bsaed filesystem, the load from Ganglia dropped to something like 0.2 or 0.3. (I don't remember the exact numbers, but I do know that we have since added Nagios on the same system to monitor about 670 hosts and 850 services. And the average load on that system is still only around 0.5 or less.)


- Use TCP polling queries instead of UDP or Multicast push. (disable UDP/multicast pushing) I'd prefer to let gmetad poll instead of having 1000 UDP messages flying around on odd intervals. A good practice?


Since gmond can't poll, I assume that you are talking about having gmetad query every gmond to gather the data? If so, I doubt it is a good idea.

First of all, every node would appear to be its own cluster, so you wouldn't be able to see graphs for all nodes on the same page or see nice summary graphs for the entire cluster.

Second, (assuming the default poll rate) gmetad would be trying to setup and tear down 1000 TCP connections every 15 secs. Sure, you could change that poll rate (more on that below), but I think UDP is much better way to handle it. I would suggest however that you consider using UPD unicast instead of multicast. It would cut down on the number of packets on your network. Even though a single metric packet is only about 60-70 bytes, I don't see the need to send around any extra data that I don't have to. The multicast method does provide you with redundancy (since every node knows the state of every other node), but if that is a concern, just unicast the packets to two hosts instead of one.

Richard Grevis made a comment in a previous reply about possibly having the nodes unicast their metrics back to the server running gmetad so that the gmetad process only had to connect to the loopback interface to retrieve data. He said that he hadn't tried it so he wasn't sure if it would work well. Fortunately, I can tell you for certain that it does since that is how I run my setup. We have a couple of clusters with about 250 nodes each as well as several smaller ones. On our server, I run a separate gmond process for each cluster. The nodes report their metrics to these gmonds. The gmetad process then contacts them to get its data. It seems to work quite well.

(I have one other comment on using UDP. It is a rather long story, so I included it at the end.)



- Alter timers for lighter network load? examples? ideas? Was going to just go to 30 or 60s timers in gmetad.conf cluster definition to start.


That could definitely help. I still use gmetad's default 15 secs because I haven't had any problems with it so far. But I have done some experimenting to see how easily I could change that if it ever becomes an issue in the future.

Plus there is the fact that even if gmetad gets information every 15 secs, that doesn't mean the gmond's have to report it every 15 secs :-) You should certainly take a look at what metrics are being collected and decide which ones are truly useful to you. Keep those and disable the rest. And for those that you do keep, look at the default reporting intervals. You may find that you don't need to report the metrics as often, or perhaps the thresholds can be set higher so that only larger changes are supported.


- Consider "federating"? Create groups of 100 gmond hosts managed by single gmetas, all linking up to a core gmetad.


That is certainly a possibility. Ganglia makes it easy to treat 1000 nodes as one cluster or as 10 100-node clusters. Although, you don't really need to have each 100 node group have its own separate gmetad associated with it. You can still just use a single one.


I should mention one final thing about large setups using UDP. It is probably pretty rare, and you may never run across this problem. But it is certainly something to keep in mind. When I first set things up, I had several metrics being reported by hourly cron jobs. I noticed that every hour, there were a small number of nodes that failed to report back a couple of those metrics. The nodes that failed to do so changed every hour. After some extensive testing, it seems that it was the result of lost UDP packets. Part of the problem was that every node kicked of the cron job at the same time, and then transmitted the metrics at roughly the same time. The other part appeared to be related to the interaction of the gmetad and gmond processes. I had these "cron metrics" sent to two gmond servers which I'll call gmond-A and gmond-B. When gmetad got its periodic info from gmond-A, there were a few nodes that showed up as not reporting metrics. But when I checked the info on gmond-B, their recent values were indeed there. So I pointed gmetad at gmond-B to gather info thinking it was because of a problem w/ gmond-A. After that, gmetad still showed a few nodes as not reporting, but now gmond-A had their recent values. After some other tests, I came to the conclusion that when the gmetad process connected to the gmond process, it "engaged" the gmond long enough to cause the socket buffer to fill up and lose a few UDP packets. So I added a random 10 sec sleep to all my cronjob metrics, and that seemed to spread out the load enough to make things OK.


I've seen similar scaling questions asked, but not a lot of answers.


How's that for answers ;-)

-- Rick

--------------------------
Rick Mohr
Systems Developer
Ohio Supercomputer Center



Reply via email to