---- Original Message ---- > From: Rick Cobb <rc...@quantcast.com> > To: Cameron Spitzer <cspit...@nvidia.com> > Cc: "ganglia-general@lists.sourceforge.net" > <ganglia-general@lists.sourceforge.net> > Sent: Sat, February 27, 2010 4:03:05 AM > Subject: Re: [Ganglia-general] replaced a host, new host not seen > > Well, one cause of the confusion is your /etc/ganglia/gmetad.conf data_source > entry. It should *only* have the address of gmonds that collect all metrics > for a cluster, and only one of your gmonds is doing that. >
Correct, listing gmonds that do not have all the information is the way to desaster. > The Ganglia architecture can be very confusing. A 'gmond' has 3 tasks, and > all > but one of yours are only doing one of them: > * Measure things about the local host and send them to the 'udp_send_channel'. which in case of multicast means "send to every gmond that cares (is listening)". In the case of unicast, it sends to *all* "udp_send_channels". This is what I usually do: have two servers acting as headnodes for the monitoring. All monitoring clients have two "udp_send_channels", sending their data to the two headnodes. I call these gmonds "collectors", as they collect the data in the first place. And I made a mistake in my reply :-( > * Receive measurements from any gmond (even itself) or gmetric on the > 'udp_recv_channel' and put them in a local datastructure, which is basically > a > set (hash) of hosts with a set of current metrics per host. This is the step > that resolves addresses to names. > * Answer requests from gmetad for the whole cluster's metrics. (It does this > on > the tcp_accept_channel). Gmond just serializes the whole metrics > datastructure > into an XML document as the reply. > In my usualy setup, these two functionalities reside on the headnode gmonds, which I call "aggregators". > If you have all your gmonds sending to one unicast address, only one of your > gmonds *has* all the metrics for that cluster. That's what Martin called > "designated as a collector." In that case, your data_source line should > only Actually I wanted to write "aggregator" for these gmonds. > include that gmond (host). Adding the others can only cause problems -- if > the > first gmond fails, your gmetad will contact the second one in the list, and > that > won't actually have any metrics on it, since no one (including itself) is > sending it any. All your nodes will (gradually as timeouts expire) appear to > be > down. 'gmond' will expire hosts if your gmond.conf has a non-zero > 'host_dmax' > entry (see http://linux.die.net/man/5/gmond.conf, among others). > > 'gmetad' is an entirely different beast from gmond; sometimes I think it was > written by a completely different team. It polls your gmonds, writes the > numeric metric values to RRDtool files, and responds to queries for (subsets > of) > metrics so front-ends can present them. It has *no* relationship with your > udp_send_channel or udp_receive_channel; also, it has almost no (AFAIK) > relationship to your network infrastructure -- it doesn't reverse-lookup > addresses, for example. > > On the other hand, it does combine all the metrics for a cluster into a > long-term in memory data structure, and then combines those into a single > 'grid-level' datastructure. In gmetad, metrics (including a last-heard-from > metric ('RECORDED') for a host) can expire, but hosts just go 'down'; they > never > go away. > > So: if you haven't set a host_dmax, you have to stop gmetad, restart every > gmond > that the gmetad could talk to (i.e., everything on the data_source line), > start > gmetad. In your case, there's only one gmond that gmetad should talk to, so > simplify your life by removing the rest from your data_source line. I'd set > host_dmax, too, but that's a matter of taste. > > -- ReC > On Feb 26, 2010, at 12:22 PM, Cameron Spitzer wrote: > > > > > I was able to remove the "dead" host (that isn't really dead) from the > overview display. > > I had to kill all gmond's everywhere, and the gmetad. > > Then I removed the rrd files for the "dead" host from gmetad's rrds > > directory, > > and the rrd directory itself. > > Then I removed the "dead" host's IP address from gmetad.conf. > > Then I brought up all the gmonds (except the "dead" one) and then the > > gmetad. > > Apparently, these steps will have to be added to our failover procedure. > > > > > > Martin Knoblauch wrote: > >> > >>> ... > >> > >> Also, just to better understand the situation, what is the exact setup? > >> Is > one of the "gmond"s designated as a collector? Or do all "gmond"s carry all > metrics from all hosts? Which "gmond" is queried by "gmetad" (snippet from > config file)? You should telnet/nc to that "gmond" and check whether it has > current metrics from "B". > >> > >> > > I don't know what "designated as a collector" means. s/collector/aggregator/ ans see above. > > Nor do I know how to control which gmonds carry all metrics from which > > hosts. > There is only one udp_send_channel > > in gmond.conf, and the host in there is the one running gmetad. > > My /etc/ganglia/gmetad.conf file has only one line in it. data_source > "clustername" followed by a > > list of IP addresses of all the gmond hosts. most likely your problem, as not all of your gmonds have all the data. Only the one on your gmetad host. > > (My original understanding was the gmetad queries each gmond, or the gmonds > all report to the gmetad. > > So I just listed all the IP addresses there. But now it seems the flow is > more complex than that.) As Rick said. It tries to query the first in the list. Only if that does not answer, it will try the next one. > > I don't have a manpage for gmetad.conf, so I just guessed what to put in > > there > from the sample file. > > ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general