Re: [Ganglia-general] replaced a host, new host not seen

Rick Cobb Fri, 26 Feb 2010 19:05:55 -0800

Well, one cause of the confusion is your /etc/ganglia/gmetad.conf data_source 
entry.  It should *only* have the address of gmonds  that collect all metrics 
for a cluster, and only one of your gmonds is doing that.

The Ganglia architecture can be very confusing. A 'gmond' has 3 tasks, and all 
but one of yours are only doing one of them:
* Measure things about the local host and send them to the 'udp_send_channel'.
* Receive measurements from any gmond (even itself) or gmetric on the 
'udp_recv_channel' and put them in a local datastructure, which is basically a 
set (hash) of hosts with a set of current metrics per host. This is the step 
that resolves addresses to names.
* Answer requests from gmetad for the whole cluster's metrics. (It does this on 
the tcp_accept_channel).  Gmond just serializes the whole metrics datastructure 
into an XML document as the reply.

If you have all your gmonds sending to one unicast address, only one of your 
gmonds *has* all the metrics for that cluster.  That's what Martin called 
"designated as a collector."   In that case, your data_source line should only 
include that gmond (host). Adding the others can only cause problems -- if the 
first gmond fails, your gmetad will contact the second one in the list, and 
that won't actually have any metrics on it, since no one (including itself) is 
sending it any.  All your nodes will (gradually as timeouts expire) appear to 
be down.  'gmond' will expire hosts if your gmond.conf has a non-zero 
'host_dmax' entry (see http://linux.die.net/man/5/gmond.conf, among others).

'gmetad' is an entirely different beast from gmond; sometimes I think it was 
written by a completely different team.  It polls your gmonds, writes the 
numeric metric values to RRDtool files, and responds to queries for (subsets 
of) metrics so front-ends can present them.  It has *no* relationship with your 
udp_send_channel or udp_receive_channel; also, it has almost no (AFAIK) 
relationship to your network infrastructure -- it doesn't reverse-lookup 
addresses, for example.

On the other hand, it does combine all the metrics for a cluster into a 
long-term in memory data structure, and then combines those into a single 
'grid-level' datastructure.  In gmetad, metrics (including a last-heard-from 
metric ('RECORDED') for a host) can expire, but hosts just go 'down'; they 
never go away.

So: if you haven't set a host_dmax, you have to stop gmetad, restart every 
gmond that the gmetad could talk to (i.e., everything on the data_source line), 
start gmetad.  In your case, there's only one gmond that gmetad should talk to, 
so simplify your life by removing the rest from your data_source line.  I'd set 
host_dmax, too, but that's a matter of taste.

-- ReC
On Feb 26, 2010, at 12:22 PM, Cameron Spitzer wrote:

> 
> I was able to remove the "dead" host (that isn't really dead) from the 
> overview display.
> I had to kill all gmond's everywhere, and the gmetad.
> Then I removed the rrd files for the "dead" host from gmetad's rrds directory,
> and the rrd directory itself.
> Then I removed the "dead" host's IP address from gmetad.conf.
> Then I brought up all the gmonds (except the "dead" one) and then the gmetad.
> Apparently, these steps will have to be added to our failover procedure.
> 
> 
> Martin Knoblauch wrote:
>> 
>>> ...
>> 
>>  Also, just to better understand the situation, what is the exact setup? Is 
>> one of the "gmond"s designated as a collector? Or do all "gmond"s carry all 
>> metrics from all hosts? Which "gmond" is queried by "gmetad" (snippet from 
>> config file)? You should telnet/nc to that "gmond" and check whether it has 
>> current metrics from "B".
>> 
>>   
> I don't know what "designated as a collector" means.
> Nor do I know how to control which gmonds carry all metrics from which hosts. 
>  There is only one udp_send_channel
> in gmond.conf, and the host in there is the one running gmetad.
> My /etc/ganglia/gmetad.conf file has only one line in it.  data_source 
> "clustername" followed by a
> list of IP addresses of all the gmond hosts.
> (My original understanding was the gmetad queries each gmond, or the gmonds 
> all report to the gmetad.
> So I just listed all the IP addresses there.  But now it seems the flow is 
> more complex than that.)
> I don't have a manpage for gmetad.conf, so I just guessed what to put in 
> there from the sample file.
> 
> -Cameron
> 
> 
> 
> This email message is for the sole use of the intended recipient(s) and may 
> contain confidential information.  Any unauthorized review, use, disclosure 
> or distribution is prohibited.  If you are not the intended recipient, please 
> contact the sender by reply email and destroy all copies of the original 
> message.
> 
> <ATT00001..txt><ATT00002..txt>

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] replaced a host, new host not seen

Reply via email to