Re: [Ganglia-general] Pointers on architecting a large scale ganglia setup??

Rick Mohr Fri, 27 Jan 2006 08:24:03 -0800

Joel,

I have replied to your questions below. Just a little background: My setupmonitor almost 600 hosts. I also wanted to measure many other metrics beyondthe default 30 or so that are built into Ganglia, so I hacked the Ganglia sourceto compile in a few others and added some cron jobs to report other metrics too.I estimate that I am monitoring about 85 metrics per host.


- Put gmetad rrd files on a ramdisk.

Definitely. When I originally set things up, I had the rrds on an ext3filesystem. The load on the system was always around 4. I tweaked some of theVM settings in /proc and got the load down to around 3, but that was the best Icould do. Turns out that the real killer was the journaling. I put the VMsettings back to their default values and remounted the filesystem as ext2.The load dropped to about 0.5! Huge improvement.

But since we had plans to put even more stuff on the server, I decided to playit safe and write to a tmpfs filesystem. For my setup, the metrics use about600 MB of RAM. I rsync it to disk every 10 minutes, and I modified thestart/stop init scripts to restore/save data. I think when I moved to aram-bsaed filesystem, the load from Ganglia dropped to something like 0.2 or0.3. (I don't remember the exact numbers, but I do know that we have sinceadded Nagios on the same system to monitor about 670 hosts and 850 services.And the average load on that system is still only around 0.5 or less.)

- Use TCP polling queries instead of UDP or Multicast push. (disableUDP/multicast pushing) I'd prefer to let gmetad poll instead of having 1000UDP messages flying around on odd intervals. A good practice?

Since gmond can't poll, I assume that you are talking about having gmetad queryevery gmond to gather the data? If so, I doubt it is a good idea.

First of all, every node would appear to be its own cluster, so you wouldn't beable to see graphs for all nodes on the same page or see nice summary graphs forthe entire cluster.

Second, (assuming the default poll rate) gmetad would be trying to setup andtear down 1000 TCP connections every 15 secs. Sure, you could change that pollrate (more on that below), but I think UDP is much better way to handle it. Iwould suggest however that you consider using UPD unicast instead of multicast.It would cut down on the number of packets on your network. Even though asingle metric packet is only about 60-70 bytes, I don't see the need to sendaround any extra data that I don't have to. The multicast method does provideyou with redundancy (since every node knows the state of every other node), butif that is a concern, just unicast the packets to two hosts instead of one.

Richard Grevis made a comment in a previous reply about possibly having thenodes unicast their metrics back to the server running gmetad so that the gmetadprocess only had to connect to the loopback interface to retrieve data. He saidthat he hadn't tried it so he wasn't sure if it would work well. Fortunately, Ican tell you for certain that it does since that is how I run my setup. We havea couple of clusters with about 250 nodes each as well as several smaller ones.On our server, I run a separate gmond process for each cluster. The nodesreport their metrics to these gmonds. The gmetad process then contacts them toget its data. It seems to work quite well.

(I have one other comment on using UDP. It is a rather long story, so Iincluded it at the end.)

- Alter timers for lighter network load? examples? ideas? Was going to just goto 30 or 60s timers in gmetad.conf cluster definition to start.

That could definitely help. I still use gmetad's default 15 secs because Ihaven't had any problems with it so far. But I have done some experimenting tosee how easily I could change that if it ever becomes an issue in the future.

Plus there is the fact that even if gmetad gets information every 15 secs, thatdoesn't mean the gmond's have to report it every 15 secs :-) You shouldcertainly take a look at what metrics are being collected and decide which onesare truly useful to you. Keep those and disable the rest. And for those thatyou do keep, look at the default reporting intervals. You may find that youdon't need to report the metrics as often, or perhaps the thresholds can be sethigher so that only larger changes are supported.

- Consider "federating"? Create groups of 100 gmond hosts managed by singlegmetas, all linking up to a core gmetad.

That is certainly a possibility. Ganglia makes it easy to treat 1000 nodes asone cluster or as 10 100-node clusters. Although, you don't really need to haveeach 100 node group have its own separate gmetad associated with it. You canstill just use a single one.

I should mention one final thing about large setups using UDP. It is probablypretty rare, and you may never run across this problem. But it is certainlysomething to keep in mind. When I first set things up, I had several metricsbeing reported by hourly cron jobs. I noticed that every hour, there were asmall number of nodes that failed to report back a couple of those metrics.The nodes that failed to do so changed every hour. After some extensivetesting, it seems that it was the result of lost UDP packets. Part of theproblem was that every node kicked of the cron job at the same time, and thentransmitted the metrics at roughly the same time. The other part appeared to berelated to the interaction of the gmetad and gmond processes. I had these "cronmetrics" sent to two gmond servers which I'll call gmond-A and gmond-B. Whengmetad got its periodic info from gmond-A, there were a few nodes that showed upas not reporting metrics. But when I checked the info on gmond-B, their recentvalues were indeed there. So I pointed gmetad at gmond-B to gather infothinking it was because of a problem w/ gmond-A. After that, gmetad stillshowed a few nodes as not reporting, but now gmond-A had their recent values.After some other tests, I came to the conclusion that when the gmetad processconnected to the gmond process, it "engaged" the gmond long enough to cause thesocket buffer to fill up and lose a few UDP packets. So I added a random 10 secsleep to all my cronjob metrics, and that seemed to spread out the load enoughto make things OK.


I've seen similar scaling questions asked, but not a lot of answers.


How's that for answers ;-)

-- Rick

--------------------------
Rick Mohr
Systems Developer
Ohio Supercomputer Center

Re: [Ganglia-general] Pointers on architecting a large scale ganglia setup??

Reply via email to