Thanks for your explanation Jason. That explains it. Before we come up with the final solution to the issue, I would like to know if the Ganglia developer-community has a 'BEST PRACTICES' for increasing the polling interval on gmetad (to 5 minutes in our case)? Should gmond.c TMAX change? Or should the function host_alive change. Does the heartbeat value in gmond.conf have any affect on that relationship?
i.e. what would be best suited for Ganglia architecture, and cause the least risk of breaking other functions, if any. Thanks, Utsav. -----Original Message----- From: Jason A. Smith [mailto:[EMAIL PROTECTED] Sent: Thursday, August 18, 2005 7:38 PM To: Utsav Agarwal Cc: [email protected]; [EMAIL PROTECTED] Subject: Re: [Ganglia-general] Gmetad reporting nodes down when they are up I believe that with a gmetad polling interval of 5 minutes you will probably end up seeing a lot of your nodes as dead. See the host_alive function in the ganglia.php file. The webfrontend will consider a host alive as long as it last heard from it in the last 4*TMAX seconds and I believe that TMAX is set to 20 seconds in the gmond code. Therefore if you reload the webfrontend shortly before gmetad is about to get fresh data there is a good change that most nodes will have TN greater than 4*TMAX. It looks like ganglia3 has TMAX hard coded to 20 seconds for hosts, see: ganglia-3.0.1/gmond/gmond.c - line 960 I couldn't find it in the code for ganglia2, but with a running gmond it appears to be set to 70 seconds. ~Jason On Thu, 2005-08-18 at 18:43, Utsav Agarwal wrote: > Hello all, > > > > A quick response would help! > > > > Our cluster nodes send udp unicast packets to a gmond 'collector'. The > gmond.conf on all the nodes (compute and collector) has the following > values: > > cleanup_threshold = 300 secs, heartbeat = 20 secs, collect_every = 300 > secs, time_threshold = 900 secs > > > > Now, the gmetad server polls the gmond 'collector' every 300 secs. (5 > minutes). What we see is that the nodes are shown up sometimes, and > then down sometimes. They flap often. Generally, either all nodes are > shown up or all nodes are shown down. While reporting the nodes are > down, it also shows that it received a heartbeat within the last 20 > seconds. > > > > We need to know the exact reason this is happening. > > > > The gmetad.conf file has default values for rrd archives. Changing the > gmetad server to poll every 120 seconds, does not seem to solve the > problem either. > > > > Any suggestions or guidelines to follow for gmetad polling interval > and gmond cleanup_threshold values will be appreciated. > > > > Thanks, > > ---------------------------------------------------------------------------- -------- > > Utsav Agarwal > > Systems Analyst > > ---------------------------------------------------------------------------- -------- > > > >

