RE: [Ganglia-general] Gmetad reporting nodes down when they are up

Utsav Agarwal Fri, 19 Aug 2005 08:30:30 -0700

Thanks for your explanation Jason. That explains it. Before we come up with
the final solution to the issue, I would like to know if the Ganglia
developer-community has a 'BEST PRACTICES' for increasing the polling
interval on gmetad (to 5 minutes in our case)? Should gmond.c TMAX change?
Or should the function host_alive change. Does the heartbeat value in
gmond.conf have any affect on that relationship?

i.e. what would be best suited for Ganglia architecture, and cause the least
risk of breaking other functions, if any.

Thanks,
Utsav.  

-----Original Message-----
From: Jason A. Smith [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 18, 2005 7:38 PM
To: Utsav Agarwal
Cc: [email protected];
[EMAIL PROTECTED]
Subject: Re: [Ganglia-general] Gmetad reporting nodes down when they are up

I believe that with a gmetad polling interval of 5 minutes you will
probably end up seeing a lot of your nodes as dead.  See the host_alive
function in the ganglia.php file.  The webfrontend will consider a host
alive as long as it last heard from it in the last 4*TMAX seconds and I
believe that TMAX is set to 20 seconds in the gmond code.  Therefore if
you reload the webfrontend shortly before gmetad is about to get fresh
data there is a good change that most nodes will have TN greater than
4*TMAX.  It looks like ganglia3 has TMAX hard coded to 20 seconds for
hosts, see:

ganglia-3.0.1/gmond/gmond.c - line 960

I couldn't find it in the code for ganglia2, but with a running gmond it
appears to be set to 70 seconds.

~Jason

On Thu, 2005-08-18 at 18:43, Utsav Agarwal wrote:
> Hello all,
> 
>  
> 
> A quick response would help!
> 
>  
> 
> Our cluster nodes send udp unicast packets to a gmond 'collector'. The
> gmond.conf on all the nodes (compute and collector) has the following
> values: 
> 
> cleanup_threshold = 300 secs, heartbeat = 20 secs, collect_every = 300
> secs, time_threshold = 900 secs
> 
>  
> 
> Now, the gmetad server polls the gmond 'collector' every 300 secs. (5
> minutes). What we see is that the nodes are shown up sometimes, and
> then down sometimes. They flap often. Generally, either all nodes are
> shown up or all nodes are shown down. While reporting the nodes are
> down, it also shows that it received a heartbeat within the last 20
> seconds.
> 
>  
> 
> We need to know the exact reason this is happening.
> 
>  
> 
> The gmetad.conf file has default values for rrd archives. Changing the
> gmetad server to poll every 120 seconds, does not seem to solve the
> problem either.
> 
>  
> 
> Any suggestions or guidelines to follow for gmetad polling interval
> and gmond cleanup_threshold values will be appreciated.
> 
>  
> 
> Thanks,
> 
>
----------------------------------------------------------------------------
--------
> 
> Utsav Agarwal
> 
> Systems Analyst
> 
>
----------------------------------------------------------------------------
--------
> 
>  
> 
>

RE: [Ganglia-general] Gmetad reporting nodes down when they are up

Reply via email to