Re: [Ganglia-general] replaced a host, new host not seen

Martin Knoblauch Sun, 28 Feb 2010 05:05:28 -0800

---- Original Message ----

> From: Rick Cobb <rc...@quantcast.com>
> To: Cameron Spitzer <cspit...@nvidia.com>
> Cc: "ganglia-general@lists.sourceforge.net" 
> <ganglia-general@lists.sourceforge.net>
> Sent: Sat, February 27, 2010 4:03:05 AM
> Subject: Re: [Ganglia-general] replaced a host, new host not seen
> 
> Well, one cause of the confusion is your /etc/ganglia/gmetad.conf data_source 
> entry.  It should *only* have the address of gmonds  that collect all metrics 
> for a cluster, and only one of your gmonds is doing that.
>


 Correct, listing gmonds that do not have all the information is the way to 
desaster.

> The Ganglia architecture can be very confusing. A 'gmond' has 3 tasks, and 
> all 
> but one of yours are only doing one of them:
> * Measure things about the local host and send them to the 'udp_send_channel'.

 which in case of multicast means "send to every gmond that cares (is 
listening)". In the case of unicast, it sends to *all* "udp_send_channels". 
This is what I usually do: have two servers acting as headnodes for the 
monitoring. All monitoring clients have two "udp_send_channels", sending their 
data to the two headnodes.

 I call these gmonds "collectors", as they collect the data in the first place. 
And I made a mistake in my reply :-(
 
> * Receive measurements from any gmond (even itself) or gmetric on the 
> 'udp_recv_channel' and put them in a local datastructure, which is basically 
> a 
> set (hash) of hosts with a set of current metrics per host. This is the step 
> that resolves addresses to names.
> * Answer requests from gmetad for the whole cluster's metrics. (It does this 
> on 
> the tcp_accept_channel).  Gmond just serializes the whole metrics 
> datastructure 
> into an XML document as the reply.
>

 In my usualy setup, these two functionalities reside on the headnode gmonds, 
which I call "aggregators".

> If you have all your gmonds sending to one unicast address, only one of your 
> gmonds *has* all the metrics for that cluster.  That's what Martin called 
> "designated as a collector."   In that case, your data_source line should 
> only 

 Actually I wanted to write "aggregator" for these gmonds. 

> include that gmond (host). Adding the others can only cause problems -- if 
> the 
> first gmond fails, your gmetad will contact the second one in the list, and 
> that 
> won't actually have any metrics on it, since no one (including itself) is 
> sending it any.  All your nodes will (gradually as timeouts expire) appear to 
> be 
> down.  'gmond' will expire hosts if your gmond.conf has a non-zero 
> 'host_dmax' 
> entry (see http://linux.die.net/man/5/gmond.conf, among others).
> 
> 'gmetad' is an entirely different beast from gmond; sometimes I think it was 
> written by a completely different team.  It polls your gmonds, writes the 
> numeric metric values to RRDtool files, and responds to queries for (subsets 
> of) 
> metrics so front-ends can present them.  It has *no* relationship with your 
> udp_send_channel or udp_receive_channel; also, it has almost no (AFAIK) 
> relationship to your network infrastructure -- it doesn't reverse-lookup 
> addresses, for example.
> 
> On the other hand, it does combine all the metrics for a cluster into a 
> long-term in memory data structure, and then combines those into a single 
> 'grid-level' datastructure.  In gmetad, metrics (including a last-heard-from 
> metric ('RECORDED') for a host) can expire, but hosts just go 'down'; they 
> never 
> go away.
> 
> So: if you haven't set a host_dmax, you have to stop gmetad, restart every 
> gmond 
> that the gmetad could talk to (i.e., everything on the data_source line), 
> start 
> gmetad.  In your case, there's only one gmond that gmetad should talk to, so 
> simplify your life by removing the rest from your data_source line.  I'd set 
> host_dmax, too, but that's a matter of taste.
> 
> -- ReC
> On Feb 26, 2010, at 12:22 PM, Cameron Spitzer wrote:
> 
> > 
> > I was able to remove the "dead" host (that isn't really dead) from the 
> overview display.
> > I had to kill all gmond's everywhere, and the gmetad.
> > Then I removed the rrd files for the "dead" host from gmetad's rrds 
> > directory,
> > and the rrd directory itself.
> > Then I removed the "dead" host's IP address from gmetad.conf.
> > Then I brought up all the gmonds (except the "dead" one) and then the 
> > gmetad.
> > Apparently, these steps will have to be added to our failover procedure.
> > 
> > 
> > Martin Knoblauch wrote:
> >> 
> >>> ...
> >> 
> >>  Also, just to better understand the situation, what is the exact setup? 
> >> Is 
> one of the "gmond"s designated as a collector? Or do all "gmond"s carry all 
> metrics from all hosts? Which "gmond" is queried by "gmetad" (snippet from 
> config file)? You should telnet/nc to that "gmond" and check whether it has 
> current metrics from "B".
> >> 
> >>  
> > I don't know what "designated as a collector" means.

 s/collector/aggregator/ ans see above.

> > Nor do I know how to control which gmonds carry all metrics from which 
> > hosts.  
> There is only one udp_send_channel
> > in gmond.conf, and the host in there is the one running gmetad.
> > My /etc/ganglia/gmetad.conf file has only one line in it.  data_source 
> "clustername" followed by a
> > list of IP addresses of all the gmond hosts.

 most likely your problem, as not all of your gmonds have all the data. Only 
the one on your gmetad host.

> > (My original understanding was the gmetad queries each gmond, or the gmonds 
> all report to the gmetad.
> > So I just listed all the IP addresses there.  But now it seems the flow is 
> more complex than that.)

 As Rick said. It tries to query the first in the list. Only if that does not 
answer, it will try the next one.

> > I don't have a manpage for gmetad.conf, so I just guessed what to put in 
> > there 
> from the sample file.
> > 

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] replaced a host, new host not seen

Reply via email to