Re: [Ganglia-general] Many (all) nodes report down

Seth Graham Thu, 12 Jul 2007 11:51:59 -0700

How many nodes are involved here?

I've seen gaps show up in the RRD graphs when the machine gets I/O bound 
and updates don't make it into the databases. This shouldn't result in a 
machine appearing "down" (all that information is stored in memory by 
gmetad), but it may be related.


I've used both unicast and multicast configs on 1000+ machines and never 
seen large swaths of nodes appear down unless the machine listed in 
gmetad.conf is down or there's a real network problem.

Have you modified the "time_threshold" in gmond.conf on your cluster 
nodes? Maybe machines just aren't reporting their metrics often enough.

Last, ganglia has problems handling situations where you have multiple 
data_sources in gmetad.conf using the same port. As ganglia divides 
everything up by the port, it will randomly choose which cluster to 
update metrics for. If it chooses one over the other more often, a lot 
of machines could appear to be down.


Matthias Blankenhaus wrote:
> Sturgis,
> 
> I have seen flaky behaviour when using the standard Ganglia configuration,
> which is based on multicasting.  I recommend changing to unicasts.  Search
> the documentation and this list to find examples.
> 
> Cheers,
> Matthias
> 
> On Tue, 10 Jul 2007, Sturgis, Grant wrote:
> 
>> Greeting list,
>>
>> New to the list, searched the archives, and read the docs.  If this is a 
>> dumb question, I do apologize.
>>
>> Occasionally, when the master node gets a load average >1, almost all 
>> (sometimes all) of the nodes report down on the Ganglia web page.  Now 
>> the master node isn't totally in the weeds, the load average is rarely 
>> above 2, and it remains very responsive to other requests.
>>
>> I have tried restarting gmond on the nodes and sometimes that works. 
>> Basically, I just need to wait and eventually everything comes back to 
>> normal.
>>
>> Is this normal, is it something I can fix?  Any suggestions are most 
>> appreciated.
>>
>> RHEL 3, ganglia-gmetad-3.0.1-1, ganglia-gmond-3.0.1-1, ganglia-web-3.0.1-1
>>
>>
>> Thanks in advance,
>>
>> Grant
>> ------------
>>
>>
>>
>>
>>
>> Pardon this rubbish:
>>
>>
>> This electronic message transmission is a PRIVATE communication which
>> contains information which may be confidential or privileged. The
>> information is intended to be for the use of the individual or entity
>> named above. If you are not the intended recipient, please be aware that
>> any disclosure, copying, distribution or use of the contents of this
>> information is prohibited. Please notify the sender  of the delivery
>> error by replying to this message, or notify us by telephone
>> (877-633-2436, ext. 0), and then delete it from your system.
>>
>> -------------------------------------------------------------------------
>> This SF.net email is sponsored by DB2 Express
>> Download DB2 Express C - the FREE version of DB2 express and take
>> control of your XML. No limits. Just data. Click to get it now.
>> http://sourceforge.net/powerbar/db2/
>> _______________________________________________
>> Ganglia-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>>
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Many (all) nodes report down

Reply via email to