This might be FAQ-worthy, we seem to be getting variations on this question
a lot. So I'll throw in some (nearly) useless or tangential information to
this post, which I've added to throughout the day while working on other
projects. BTW, that pic didn't help.
So, for the record:
Monitoring cores (gmonds) currently only transmit metrics generated
locally. Therefore a monitoring core needs to be run on every system that
you want to monitor. All monitoring cores on the same cluster should be
functionally identical, so it shouldn't matter which one the metadaemon
queries.
Monitoring cores ONLY "hear" other monitoring cores that use the same
multicast destination address. Note that the default mcast_ttl value (the
Time To Live) is set rather low - you may want to set it higher depending
on your network topology. In other words, if the furthest route between
your monitoring cores passes through three routers, you'd want to use an
mcast_ttl value of 4 on all hosts in that cluster.
Network equipment also has to be configured to pass multicast traffic.
Funny network stuff can result in a situation like this:
* Your network geek has added a rule to drop multicast traffic between
subnets on your MegaRouter MPF-9000XL core router.
* Your cluster is spread evenly (let's say 150 hosts apiece) across three
subnets, all hanging off that router.
* Your gmetad config points to one member of each subnet for your
cluster's data source.
Here's what you would see on the web page:
150 hosts up (instead of 450). The hosts present will be from the first
host to respond when gmetad polled the cluster data source.
Now let's say the first-responding monitoring core doesn't ge. Now you'll
see something like this when gmetad finds another source (on another
filtered subnet):
150 hosts up (instead of 450), and probably 150 hosts down (I can't
remember how smart gmetad was when this happened to me - it may have just
ignored the 150 old hosts). The 150 hosts up are the hosts responding on
the second-polled monitoring core's subnet.
Fixes to this scenario are: Try increasing the TTL value, segment your
cluster by subnet, or just fix the network equipment (details of which are
beyond the scope of this document - hire one of the many unemployed CCNEs
lining El Camino Real begging for change).
Moving on to web front-end problems:
Check all the paths *AND PERMISSIONS* in config.php.
Remember that the PHP scripts all query gmetad, so gmetad must be running.
It listens on port 8651, so remember to test for that port if you run
into weirdness.
Check all the paths *AND PERMISSIONS* in config.php.
gmetad is responsible for creating and updating the RRD files. The PHP
scripts are responsible for reading and *displaying* them. If you can't
see any graphs, first check to make sure they're being created and updated
(the timestamps on them should be current as long as gmetad is running
properly). If this is all true, then your PHP scripts (specifically
graph.php) can't run the RRD generation routines (due to a bad path or a
broken rrdtool installation) or can't access the RRD file (are the RRD
files and directories readable by the web user/group?).
Finally - and I can't stress this enough - you absolutely must not forget
to scan all parts of your network for feral monkeys. Monkeys are VERY
DANGEROUS in data center environments.
Also, don't forget to check all the paths *AND PERMISSIONS* in config.php.
gmetad should be depositing RRDs in the same location that the PHP
scripts are reading them from. gmetad should be able to read/write to that
directory tree, and the web server process should be able to read from it.
Hopefully this will solve a few people's problems, or at least help point
them in the wrong direction, or at least provide some laughs.
Good luck! :)
Ollisl wrote:
Hi,
I can't see my Host A or Host B (look at the pic:
http://users.evitech.fi/~ollisl/ganglia/testing.bmp ) in the web-page
created by my monitoring computer(It has the gmond, gmetad and web-
frontend).
I've modified /etc/gmond.conf in all of the three computers to trust
both of the other two IP-addresses. Also /etc/gmetad.conf needed some
tweaking too.
Do I need to do something to /usr/local/apache/htdocs/ganglia-
webfrontend/conf.php ?
Also, I still can't see any pics on the web-page. All the RRD-files
seem to be in place. And I've tried telnetting the 8649, it works...
I'm pretty much out of ideas...
What I want is to see if I can monitor the whole beowulf cluster with
only the master node having gmond. But first I want to get the ganglia
to work exactly the way I want... I'm pretty much a newbie in Linux and
beowulf(I've only built three before) and never seen ganglia before, so
I knew there would be troubles, but I've to say that I've seen better
documents for software than Ganglia's... Well, there is no spo... I
mean, free luch ;)
Thanks
-Olli
-------------------------------------------------------
This SF.net email is sponsored by: Get the new Palm Tungsten T
handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general