Rick Cobb wrote:
Well, one cause of the confusion is your /etc/ganglia/gmetad.conf data_source entry.  It should *only* have the address of gmonds  that collect all metrics for a cluster, and only one of your gmonds is doing that.

The Ganglia architecture can be very confusing.
Not so much confusing, as undocumented in any useful way.
I've been through the manpages and what installation instructions
I could find, and the comments in the sample .conf files.
I didn't find and never would have guessed the
information in the two paragraphs below.
Maybe it's in one of the research papers.

A 'gmond' has 3 tasks, and all but one of yours are only doing one of them:
* Measure things about the local host and send them to the 'udp_send_channel'.
* Receive measurements from any gmond (even itself) or gmetric on the 'udp_recv_channel' and put them in a local datastructure, which is basically a set (hash) of hosts with a set of current metrics per host. This is the step that resolves addresses to names.
* Answer requests from gmetad for the whole cluster's metrics. (It does this on the tcp_accept_channel).  Gmond just serializes the whole metrics datastructure into an XML document as the reply.

If you have all your gmonds sending to one unicast address, only one of your gmonds *has* all the metrics for that cluster.  That's what Martin called "designated as a collector."   In that case, your data_source line should only include that gmond (host). Adding the others can only cause problems -- if the first gmond fails, your gmetad will contact the second one in the list, and that won't actually have any metrics on it, since no one (including itself) is sending it any.  All your nodes will (gradually as timeouts expire) appear to be down.  'gmond' will expire hosts if your gmond.conf has a non-zero 'host_dmax' entry (see http://linux.die.net/man/5/gmond.conf, among others).
  
It reminds me of the situation with Drupal, which has tons of
completely unorganized "hints and tips" documentation,
and two or three failed attempts at an overview, but nothing
you can just read and make sense of.  Over time, the
non-documentation becomes a kind of placeholder, preventing
better docs from getting written, as the only people who know
the product well enough to be able to write anything better
are also the only people who don't need anything better, deadlock.

Anyhow, I'll try designating one collector and see if things
work better.   Thanks.

-Cameron




This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to