How many nodes are involved here? I've seen gaps show up in the RRD graphs when the machine gets I/O bound and updates don't make it into the databases. This shouldn't result in a machine appearing "down" (all that information is stored in memory by gmetad), but it may be related.
I've used both unicast and multicast configs on 1000+ machines and never seen large swaths of nodes appear down unless the machine listed in gmetad.conf is down or there's a real network problem. Have you modified the "time_threshold" in gmond.conf on your cluster nodes? Maybe machines just aren't reporting their metrics often enough. Last, ganglia has problems handling situations where you have multiple data_sources in gmetad.conf using the same port. As ganglia divides everything up by the port, it will randomly choose which cluster to update metrics for. If it chooses one over the other more often, a lot of machines could appear to be down. Matthias Blankenhaus wrote: > Sturgis, > > I have seen flaky behaviour when using the standard Ganglia configuration, > which is based on multicasting. I recommend changing to unicasts. Search > the documentation and this list to find examples. > > Cheers, > Matthias > > On Tue, 10 Jul 2007, Sturgis, Grant wrote: > >> Greeting list, >> >> New to the list, searched the archives, and read the docs. If this is a >> dumb question, I do apologize. >> >> Occasionally, when the master node gets a load average >1, almost all >> (sometimes all) of the nodes report down on the Ganglia web page. Now >> the master node isn't totally in the weeds, the load average is rarely >> above 2, and it remains very responsive to other requests. >> >> I have tried restarting gmond on the nodes and sometimes that works. >> Basically, I just need to wait and eventually everything comes back to >> normal. >> >> Is this normal, is it something I can fix? Any suggestions are most >> appreciated. >> >> RHEL 3, ganglia-gmetad-3.0.1-1, ganglia-gmond-3.0.1-1, ganglia-web-3.0.1-1 >> >> >> Thanks in advance, >> >> Grant >> ------------ >> >> >> >> >> >> Pardon this rubbish: >> >> >> This electronic message transmission is a PRIVATE communication which >> contains information which may be confidential or privileged. The >> information is intended to be for the use of the individual or entity >> named above. If you are not the intended recipient, please be aware that >> any disclosure, copying, distribution or use of the contents of this >> information is prohibited. Please notify the sender of the delivery >> error by replying to this message, or notify us by telephone >> (877-633-2436, ext. 0), and then delete it from your system. >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by DB2 Express >> Download DB2 Express C - the FREE version of DB2 express and take >> control of your XML. No limits. Just data. Click to get it now. >> http://sourceforge.net/powerbar/db2/ >> _______________________________________________ >> Ganglia-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/ganglia-general >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Ganglia-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/ganglia-general ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

