Re: [Ganglia-general] Pointers on architecting a large scale ganglia setup??

Jason A. Smith Fri, 27 Jan 2006 09:09:34 -0800

I have commented on this multicast question before and I honestly do not
understand the fear of using multicast.  Like Rick said below, the
packets are small (60 bytes) and each node sends only about 20/minute on
average.  Do the math, even with a large cluster you are talking about a
tiny fraction of the network's capacity.  For us, redundancy is
important because we have many clusters and do not want to single out
one node in each cluster to make it a special monitoring node that needs
to be up 24/7 in order to collect ganglia monitoring data.  With
multicast's redundancy, we won't have to worry about the possibility
that the one special node out of our 400+ node cluster has crashed
taking ganglia down with it.  If the node gmetad is currently getting
data from has a problem, it will transparently switch to another.


Our largest cluster has 427 linux nodes with a basic gmond using all of
the default metrics.  I ran a tcpdump for about 10 minutes to capture
all of the multicast data in that cluster and here is what I found (*):

5,256,772 bytes collected in 628.45 seconds from 87,582 packets.

In other words, the average rate was:

139.362 packets/second
8364.668 bytes/second
0.066 Mbits/second

* Directly from ethereal's Summary window.

This is noise on modern 1 Gbit network, or even on a 100Mbit network.

~Jason


On Fri, 2006-01-27 at 11:26 -0500, Rick Mohr wrote:
> Joel,
> 
> I have replied to your questions below.  Just a little background: My setup 
> monitor almost 600 hosts.  I also wanted to measure many other metrics beyond 
> the default 30 or so that are built into Ganglia, so I hacked the Ganglia 
> source 
> to compile in a few others and added some cron jobs to report other metrics 
> too. 
> I estimate that I am monitoring about 85 metrics per host.
> 
> >
> > - Put gmetad rrd files on a ramdisk.
> >
> 
> Definitely.  When I originally set things up, I had the rrds on an ext3 
> filesystem.  The load on the system was always around 4.  I tweaked some of 
> the 
> VM settings in /proc and got the load down to around 3, but that was the best 
> I 
> could do. Turns out that the real killer was the journaling.  I put the VM 
> settings back to their default values and remounted the filesystem as ext2. 
> The load dropped to about 0.5!  Huge improvement.
> 
> But since we had plans to put even more stuff on the server, I decided to 
> play 
> it safe and write to a tmpfs filesystem.  For my setup, the metrics use about 
> 600 MB of RAM.  I rsync it to disk every 10 minutes, and I modified the 
> start/stop init scripts to restore/save data.  I think when I moved to a 
> ram-bsaed filesystem, the load from Ganglia dropped to something like 0.2 or 
> 0.3.  (I don't remember the exact numbers, but I do know that we have since 
> added Nagios on the same system to monitor about 670 hosts and 850 services. 
> And the average load on that system is still only around 0.5 or less.)
> 
> >
> > - Use TCP polling queries instead of UDP or Multicast push. (disable 
> > UDP/multicast pushing) I'd prefer to let gmetad poll instead of having 1000 
> > UDP messages flying around on odd intervals.  A good practice?
> >
> 
> Since gmond can't poll, I assume that you are talking about having gmetad 
> query 
> every gmond to gather the data?  If so, I doubt it is a good idea.
> 
> First of all, every node would appear to be its own cluster, so you wouldn't 
> be 
> able to see graphs for all nodes on the same page or see nice summary graphs 
> for 
> the entire cluster.
> 
> Second, (assuming the default poll rate) gmetad would be trying to setup and 
> tear down 1000 TCP connections every 15 secs. Sure, you could change that 
> poll 
> rate (more on that below), but I think UDP is much better way to handle it.  
> I 
> would suggest however that you consider using UPD unicast instead of 
> multicast. 
> It would cut down on the number of packets on your network.  Even though a 
> single metric packet is only about 60-70 bytes, I don't see the need to send 
> around any extra data that I don't have to.  The multicast method does 
> provide 
> you with redundancy (since every node knows the state of every other node), 
> but 
> if that is a concern, just unicast the packets to two hosts instead of one.
> 
> Richard Grevis made a comment in a previous reply about possibly having the 
> nodes unicast their metrics back to the server running gmetad so that the 
> gmetad 
> process only had to connect to the loopback interface to retrieve data.  He 
> said 
> that he hadn't tried it so he wasn't sure if it would work well.  
> Fortunately, I 
> can tell you for certain that it does since that is how I run my setup.  We 
> have 
> a couple of clusters with about 250 nodes each as well as several smaller 
> ones. 
> On our server, I run a separate gmond process for each cluster.  The nodes 
> report their metrics to these gmonds.  The gmetad process then contacts them 
> to 
> get its data.  It seems to work quite well.
> 
> (I have one other comment on using UDP.  It is a rather long story, so I 
> included it at the end.)
> 
> 
> >
> > - Alter timers for lighter network load? examples? ideas? Was going to just 
> > go 
> > to 30 or 60s timers in gmetad.conf cluster definition to start.
> >
> 
> That could definitely help.  I still use gmetad's default 15 secs because I 
> haven't had any problems with it so far.  But I have done some experimenting 
> to 
> see how easily I could change that if it ever becomes an issue in the future.
> 
> Plus there is the fact that even if gmetad gets information every 15 secs, 
> that 
> doesn't mean the gmond's have to report it every 15 secs :-)  You should 
> certainly take a look at what metrics are being collected and decide which 
> ones 
> are truly useful to you.  Keep those and disable the rest.  And for those 
> that 
> you do keep, look at the default reporting intervals.  You may find that you 
> don't need to report the metrics as often, or perhaps the thresholds can be 
> set 
> higher so that only larger changes are supported.
> 
> >
> > - Consider "federating"? Create groups of 100 gmond hosts managed by single 
> > gmetas, all linking up to a core gmetad.
> >
> 
> That is certainly a possibility.  Ganglia makes it easy to treat 1000 nodes 
> as 
> one cluster or as 10 100-node clusters.  Although, you don't really need to 
> have 
> each 100 node group have its own separate gmetad associated with it.  You can 
> still just use a single one.
> 
> 
> I should mention one final thing about large setups using UDP.  It is 
> probably 
> pretty rare, and you may never run across this problem.  But it is certainly 
> something to keep in mind. When I first set things up, I had several metrics 
> being reported by hourly cron jobs.  I noticed that every hour, there were a 
> small number of nodes that failed to report back a couple of those metrics. 
> The nodes that failed to do so changed every hour.  After some extensive 
> testing, it seems that it was the result of lost UDP packets.  Part of the 
> problem was that every node kicked of the cron job at the same time, and then 
> transmitted the metrics at roughly the same time.  The other part appeared to 
> be 
> related to the interaction of the gmetad and gmond processes.  I had these 
> "cron 
> metrics" sent to two gmond servers which I'll call gmond-A and gmond-B.  When 
> gmetad got its periodic info from gmond-A, there were a few nodes that showed 
> up 
> as not reporting metrics.  But when I checked the info on gmond-B, their 
> recent 
> values were indeed there.  So I pointed gmetad at gmond-B to gather info 
> thinking it was because of a problem w/ gmond-A.  After that, gmetad still 
> showed a few nodes as not reporting, but now gmond-A had their recent values. 
> After some other tests, I came to the conclusion that when the gmetad process 
> connected to the gmond process, it "engaged" the gmond long enough to cause 
> the 
> socket buffer to fill up and lose a few UDP packets.  So I added a random 10 
> sec 
> sleep to all my cronjob metrics, and that seemed to spread out the load 
> enough 
> to make things OK.
> 
> >
> > I've seen similar scaling questions asked, but not a lot of answers.
> >
> 
> How's that for answers ;-)
> 
> -- Rick
> 
> --------------------------
> Rick Mohr
> Systems Developer
> Ohio Supercomputer Center
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
> 
-- 
/------------------------------------------------------------------\
|  Jason A. Smith                          Email:  [EMAIL PROTECTED] |
|  Atlas Computing Facility, Bldg. 510M    Phone:  (631)344-4226   |
|  Brookhaven National Lab, P.O. Box 5000  Fax:    (631)344-7616   |
|  Upton, NY 11973-5000                                            |
\------------------------------------------------------------------/

Re: [Ganglia-general] Pointers on architecting a large scale ganglia setup??

Reply via email to