Rick Cobb wrote:
Not so much confusing, as undocumented in any useful way.Well, one cause of the confusion is your /etc/ganglia/gmetad.conf data_source entry. It should *only* have the address of gmonds that collect all metrics for a cluster, and only one of your gmonds is doing that.The Ganglia architecture can be very confusing. I've been through the manpages and what installation instructions I could find, and the comments in the sample .conf files. I didn't find and never would have guessed the information in the two paragraphs below. Maybe it's in one of the research papers. It reminds me of the situation with Drupal, which has tons ofA 'gmond' has 3 tasks, and all but one of yours are only doing one of them: * Measure things about the local host and send them to the 'udp_send_channel'. * Receive measurements from any gmond (even itself) or gmetric on the 'udp_recv_channel' and put them in a local datastructure, which is basically a set (hash) of hosts with a set of current metrics per host. This is the step that resolves addresses to names. * Answer requests from gmetad for the whole cluster's metrics. (It does this on the tcp_accept_channel). Gmond just serializes the whole metrics datastructure into an XML document as the reply. If you have all your gmonds sending to one unicast address, only one of your gmonds *has* all the metrics for that cluster. That's what Martin called "designated as a collector." In that case, your data_source line should only include that gmond (host). Adding the others can only cause problems -- if the first gmond fails, your gmetad will contact the second one in the list, and that won't actually have any metrics on it, since no one (including itself) is sending it any. All your nodes will (gradually as timeouts expire) appear to be down. 'gmond' will expire hosts if your gmond.conf has a non-zero 'host_dmax' entry (see http://linux.die.net/man/5/gmond.conf, among others). completely unorganized "hints and tips" documentation, and two or three failed attempts at an overview, but nothing you can just read and make sense of. Over time, the non-documentation becomes a kind of placeholder, preventing better docs from getting written, as the only people who know the product well enough to be able to write anything better are also the only people who don't need anything better, deadlock. Anyhow, I'll try designating one collector and see if things work better. Thanks. -Cameron This email message is for the sole use of the intended recipient(s) and may
contain confidential information. Any unauthorized review, use, disclosure
or distribution is prohibited. If you are not the intended recipient,
please contact the sender by reply email and destroy all copies of the original
message.
|
------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev
_______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

