Joel,
I have replied to your questions below. Just a little background: My setup
monitor almost 600 hosts. I also wanted to measure many other metrics beyond
the default 30 or so that are built into Ganglia, so I hacked the Ganglia source
to compile in a few others and added some cron jobs to report other metrics too.
I estimate that I am monitoring about 85 metrics per host.
- Put gmetad rrd files on a ramdisk.
Definitely. When I originally set things up, I had the rrds on an ext3
filesystem. The load on the system was always around 4. I tweaked some of the
VM settings in /proc and got the load down to around 3, but that was the best I
could do. Turns out that the real killer was the journaling. I put the VM
settings back to their default values and remounted the filesystem as ext2.
The load dropped to about 0.5! Huge improvement.
But since we had plans to put even more stuff on the server, I decided to play
it safe and write to a tmpfs filesystem. For my setup, the metrics use about
600 MB of RAM. I rsync it to disk every 10 minutes, and I modified the
start/stop init scripts to restore/save data. I think when I moved to a
ram-bsaed filesystem, the load from Ganglia dropped to something like 0.2 or
0.3. (I don't remember the exact numbers, but I do know that we have since
added Nagios on the same system to monitor about 670 hosts and 850 services.
And the average load on that system is still only around 0.5 or less.)
- Use TCP polling queries instead of UDP or Multicast push. (disable
UDP/multicast pushing) I'd prefer to let gmetad poll instead of having 1000
UDP messages flying around on odd intervals. A good practice?
Since gmond can't poll, I assume that you are talking about having gmetad query
every gmond to gather the data? If so, I doubt it is a good idea.
First of all, every node would appear to be its own cluster, so you wouldn't be
able to see graphs for all nodes on the same page or see nice summary graphs for
the entire cluster.
Second, (assuming the default poll rate) gmetad would be trying to setup and
tear down 1000 TCP connections every 15 secs. Sure, you could change that poll
rate (more on that below), but I think UDP is much better way to handle it. I
would suggest however that you consider using UPD unicast instead of multicast.
It would cut down on the number of packets on your network. Even though a
single metric packet is only about 60-70 bytes, I don't see the need to send
around any extra data that I don't have to. The multicast method does provide
you with redundancy (since every node knows the state of every other node), but
if that is a concern, just unicast the packets to two hosts instead of one.
Richard Grevis made a comment in a previous reply about possibly having the
nodes unicast their metrics back to the server running gmetad so that the gmetad
process only had to connect to the loopback interface to retrieve data. He said
that he hadn't tried it so he wasn't sure if it would work well. Fortunately, I
can tell you for certain that it does since that is how I run my setup. We have
a couple of clusters with about 250 nodes each as well as several smaller ones.
On our server, I run a separate gmond process for each cluster. The nodes
report their metrics to these gmonds. The gmetad process then contacts them to
get its data. It seems to work quite well.
(I have one other comment on using UDP. It is a rather long story, so I
included it at the end.)
- Alter timers for lighter network load? examples? ideas? Was going to just go
to 30 or 60s timers in gmetad.conf cluster definition to start.
That could definitely help. I still use gmetad's default 15 secs because I
haven't had any problems with it so far. But I have done some experimenting to
see how easily I could change that if it ever becomes an issue in the future.
Plus there is the fact that even if gmetad gets information every 15 secs, that
doesn't mean the gmond's have to report it every 15 secs :-) You should
certainly take a look at what metrics are being collected and decide which ones
are truly useful to you. Keep those and disable the rest. And for those that
you do keep, look at the default reporting intervals. You may find that you
don't need to report the metrics as often, or perhaps the thresholds can be set
higher so that only larger changes are supported.
- Consider "federating"? Create groups of 100 gmond hosts managed by single
gmetas, all linking up to a core gmetad.
That is certainly a possibility. Ganglia makes it easy to treat 1000 nodes as
one cluster or as 10 100-node clusters. Although, you don't really need to have
each 100 node group have its own separate gmetad associated with it. You can
still just use a single one.
I should mention one final thing about large setups using UDP. It is probably
pretty rare, and you may never run across this problem. But it is certainly
something to keep in mind. When I first set things up, I had several metrics
being reported by hourly cron jobs. I noticed that every hour, there were a
small number of nodes that failed to report back a couple of those metrics.
The nodes that failed to do so changed every hour. After some extensive
testing, it seems that it was the result of lost UDP packets. Part of the
problem was that every node kicked of the cron job at the same time, and then
transmitted the metrics at roughly the same time. The other part appeared to be
related to the interaction of the gmetad and gmond processes. I had these "cron
metrics" sent to two gmond servers which I'll call gmond-A and gmond-B. When
gmetad got its periodic info from gmond-A, there were a few nodes that showed up
as not reporting metrics. But when I checked the info on gmond-B, their recent
values were indeed there. So I pointed gmetad at gmond-B to gather info
thinking it was because of a problem w/ gmond-A. After that, gmetad still
showed a few nodes as not reporting, but now gmond-A had their recent values.
After some other tests, I came to the conclusion that when the gmetad process
connected to the gmond process, it "engaged" the gmond long enough to cause the
socket buffer to fill up and lose a few UDP packets. So I added a random 10 sec
sleep to all my cronjob metrics, and that seemed to spread out the load enough
to make things OK.
I've seen similar scaling questions asked, but not a lot of answers.
How's that for answers ;-)
-- Rick
--------------------------
Rick Mohr
Systems Developer
Ohio Supercomputer Center