Well, one cause of the confusion is your /etc/ganglia/gmetad.conf data_source entry. It should *only* have the address of gmonds that collect all metrics for a cluster, and only one of your gmonds is doing that.
The Ganglia architecture can be very confusing. A 'gmond' has 3 tasks, and all but one of yours are only doing one of them: * Measure things about the local host and send them to the 'udp_send_channel'. * Receive measurements from any gmond (even itself) or gmetric on the 'udp_recv_channel' and put them in a local datastructure, which is basically a set (hash) of hosts with a set of current metrics per host. This is the step that resolves addresses to names. * Answer requests from gmetad for the whole cluster's metrics. (It does this on the tcp_accept_channel). Gmond just serializes the whole metrics datastructure into an XML document as the reply. If you have all your gmonds sending to one unicast address, only one of your gmonds *has* all the metrics for that cluster. That's what Martin called "designated as a collector." In that case, your data_source line should only include that gmond (host). Adding the others can only cause problems -- if the first gmond fails, your gmetad will contact the second one in the list, and that won't actually have any metrics on it, since no one (including itself) is sending it any. All your nodes will (gradually as timeouts expire) appear to be down. 'gmond' will expire hosts if your gmond.conf has a non-zero 'host_dmax' entry (see http://linux.die.net/man/5/gmond.conf, among others). 'gmetad' is an entirely different beast from gmond; sometimes I think it was written by a completely different team. It polls your gmonds, writes the numeric metric values to RRDtool files, and responds to queries for (subsets of) metrics so front-ends can present them. It has *no* relationship with your udp_send_channel or udp_receive_channel; also, it has almost no (AFAIK) relationship to your network infrastructure -- it doesn't reverse-lookup addresses, for example. On the other hand, it does combine all the metrics for a cluster into a long-term in memory data structure, and then combines those into a single 'grid-level' datastructure. In gmetad, metrics (including a last-heard-from metric ('RECORDED') for a host) can expire, but hosts just go 'down'; they never go away. So: if you haven't set a host_dmax, you have to stop gmetad, restart every gmond that the gmetad could talk to (i.e., everything on the data_source line), start gmetad. In your case, there's only one gmond that gmetad should talk to, so simplify your life by removing the rest from your data_source line. I'd set host_dmax, too, but that's a matter of taste. -- ReC On Feb 26, 2010, at 12:22 PM, Cameron Spitzer wrote: > > I was able to remove the "dead" host (that isn't really dead) from the > overview display. > I had to kill all gmond's everywhere, and the gmetad. > Then I removed the rrd files for the "dead" host from gmetad's rrds directory, > and the rrd directory itself. > Then I removed the "dead" host's IP address from gmetad.conf. > Then I brought up all the gmonds (except the "dead" one) and then the gmetad. > Apparently, these steps will have to be added to our failover procedure. > > > Martin Knoblauch wrote: >> >>> ... >> >> Also, just to better understand the situation, what is the exact setup? Is >> one of the "gmond"s designated as a collector? Or do all "gmond"s carry all >> metrics from all hosts? Which "gmond" is queried by "gmetad" (snippet from >> config file)? You should telnet/nc to that "gmond" and check whether it has >> current metrics from "B". >> >> > I don't know what "designated as a collector" means. > Nor do I know how to control which gmonds carry all metrics from which hosts. > There is only one udp_send_channel > in gmond.conf, and the host in there is the one running gmetad. > My /etc/ganglia/gmetad.conf file has only one line in it. data_source > "clustername" followed by a > list of IP addresses of all the gmond hosts. > (My original understanding was the gmetad queries each gmond, or the gmonds > all report to the gmetad. > So I just listed all the IP addresses there. But now it seems the flow is > more complex than that.) > I don't have a manpage for gmetad.conf, so I just guessed what to put in > there from the sample file. > > -Cameron > > > > This email message is for the sole use of the intended recipient(s) and may > contain confidential information. Any unauthorized review, use, disclosure > or distribution is prohibited. If you are not the intended recipient, please > contact the sender by reply email and destroy all copies of the original > message. > > <ATT00001..txt><ATT00002..txt> ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

