Re: [Ganglia-general] Can't see the other gmonds.

Steven Wagner Mon, 25 Nov 2002 18:13:52 -0800

This might be FAQ-worthy, we seem to be getting variations on this questiona lot. So I'll throw in some (nearly) useless or tangential information tothis post, which I've added to throughout the day while working on otherprojects. BTW, that pic didn't help.


So, for the record:

Monitoring cores (gmonds) currently only transmit metrics generatedlocally. Therefore a monitoring core needs to be run on every system thatyou want to monitor. All monitoring cores on the same cluster should befunctionally identical, so it shouldn't matter which one the metadaemonqueries.

Monitoring cores ONLY "hear" other monitoring cores that use the samemulticast destination address. Note that the default mcast_ttl value (theTime To Live) is set rather low - you may want to set it higher dependingon your network topology. In other words, if the furthest route betweenyour monitoring cores passes through three routers, you'd want to use anmcast_ttl value of 4 on all hosts in that cluster.

Network equipment also has to be configured to pass multicast traffic.Funny network stuff can result in a situation like this:

* Your network geek has added a rule to drop multicast traffic betweensubnets on your MegaRouter MPF-9000XL core router.* Your cluster is spread evenly (let's say 150 hosts apiece) across threesubnets, all hanging off that router.* Your gmetad config points to one member of each subnet for yourcluster's data source.


Here's what you would see on the web page:

150 hosts up (instead of 450). The hosts present will be from the firsthost to respond when gmetad polled the cluster data source.

Now let's say the first-responding monitoring core doesn't ge. Now you'llsee something like this when gmetad finds another source (on anotherfiltered subnet):

150 hosts up (instead of 450), and probably 150 hosts down (I can'tremember how smart gmetad was when this happened to me - it may have justignored the 150 old hosts). The 150 hosts up are the hosts responding onthe second-polled monitoring core's subnet.

Fixes to this scenario are: Try increasing the TTL value, segment yourcluster by subnet, or just fix the network equipment (details of which arebeyond the scope of this document - hire one of the many unemployed CCNEslining El Camino Real begging for change).


Moving on to web front-end problems:

Check all the paths *AND PERMISSIONS* in config.php.

Remember that the PHP scripts all query gmetad, so gmetad must be running.It listens on port 8651, so remember to test for that port if you runinto weirdness.


Check all the paths *AND PERMISSIONS* in config.php.

gmetad is responsible for creating and updating the RRD files. The PHPscripts are responsible for reading and *displaying* them. If you can'tsee any graphs, first check to make sure they're being created and updated(the timestamps on them should be current as long as gmetad is runningproperly). If this is all true, then your PHP scripts (specificallygraph.php) can't run the RRD generation routines (due to a bad path or abroken rrdtool installation) or can't access the RRD file (are the RRDfiles and directories readable by the web user/group?).

Finally - and I can't stress this enough - you absolutely must not forgetto scan all parts of your network for feral monkeys. Monkeys are VERYDANGEROUS in data center environments.

Also, don't forget to check all the paths *AND PERMISSIONS* in config.php.gmetad should be depositing RRDs in the same location that the PHPscripts are reading them from. gmetad should be able to read/write to thatdirectory tree, and the web server process should be able to read from it.

Hopefully this will solve a few people's problems, or at least help pointthem in the wrong direction, or at least provide some laughs.


Good luck! :)

Ollisl wrote:

Hi,

I can't see my Host A or Host B (look at the pic:
http://users.evitech.fi/~ollisl/ganglia/testing.bmp ) in the web-page
created by my monitoring computer(It has the gmond, gmetad and web-
frontend).

I've modified /etc/gmond.conf in all of the three computers to trust
both of the other two IP-addresses. Also /etc/gmetad.conf needed some
tweaking too.

Do I need to do something to /usr/local/apache/htdocs/ganglia-
webfrontend/conf.php ?

Also, I still can't see any pics on the web-page. All the RRD-files
seem to be in place. And I've tried telnetting the 8649, it works...

I'm pretty much out of ideas...

What I want is to see if I can monitor the whole beowulf cluster with
only the master node having gmond. But first I want to get the ganglia
to work exactly the way I want... I'm pretty much a newbie in Linux and
beowulf(I've only built three before) and never seen ganglia before, so
I knew there would be troubles, but I've to say that I've seen better
documents for software than Ganglia's...  Well, there is no spo... I
mean, free luch ;)

Thanks
-Olli




-------------------------------------------------------

This SF.net email is sponsored by: Get the new Palm Tungsten Thandheld. Power & Color in a compact size!http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en

_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Can't see the other gmonds.

Reply via email to