Kevin James Flasch wrote:
*  Check some of the gmond-only nodes' XML port output.  How many nodes
do they see?  Do they see 289-295 nodes or just their own output?


I believe you're referring to the mcast_port (by default 8649). When I telnet
to it, I see what appears to be all/most of them.
(`telnet localhost 8649 | grep "<HOST " | wc -l`  gives me 300).

wc -l's a good start but you should actually check each host's timestamp value. If the timestamps are fairly close to one another and close to NOW(), then you know that the monitoring core you're polling is receiving packets from all 300 hosts often enough for them to be considered "up" - the REPORTED attribute is updated every time any metric is received from a given host.

They are not all in the same subnet. There are two subnets that they reside
in.

They are all physically connected to the same switch.

There is no firewalling of the sort that blocks ports, drops packets on the
master. The idea that there is something wrong with the network connection
seems reasonable. I can't see anything outstanding about it, however, and
there have been no network problems with the connection otherwise.

So far so good...

*  Consider polling a different set of monitoring cores as your gmetad
cluster data source.


I'm not sure I follow. Can you explain or give an example, please?

Sure. gmetad has a configuration file, /etc/gmetad.conf by default, that specifies data sources. gmetad considers each of these data sources to be a different cluster. You can specify a polling frequency and a list of IP(:port) combos for each cluster. These will be checked from left to right.

Example:

data_source mycluster 15 10.0.0.2 10.0.0.3:2463 10.0.0.4 10.0.0.5
data_source anothercluster 60 192.168.7.15

In order to debug gmetad, it helps to "see what the killer sees" by telnetting to each of these sources in the same order from the node running the metadaemon. This should at least point you at the misbehaving monitoring core.

It may well be that the local monitoring core on the front-end is the one that's misconfigured somehow.

*  Run a monitoring core in debug mode.  You will see what metrics it's
sending and what metrics it's hearing on the multicast channel.


Hmm.. I'm not sure what the output of that should look like on node in a
functioning ganglia environment. It seems like it's communicating somewhat
with the other nodes, but most of the entries seem to be about itself. One of
the entries mentioning another machine look like this:
>
> Is that less data than typical?


On a 300-node Ganglia cluster you should be seeing at least load average metrics being multicast from every node every 15-60 seconds, plus the various other metrics according to their thresholds. Regardless, you should see more than a packet every few seconds.

In fact if you didn't find it necessary to redirect the debug output to a file, you're probably not getting all the packets. :)

*  tcpdump.  Limit it to just the multicast IP or port and you should be
able to get all Ganglia-related traffic that the running host can hear.


That's what I did before to check the frequency of ganglia traffic. Most of
the traffic is the machine itself broadcasting 8 byte (ocassionally 12 byte)
udp packets on the multicast channel. Once and a while an 8 byte udp packet
from another node will come on the multicast channel (after every 5-15
originating packets on the multicast channel).

See above, you should be getting them more than once in a while. It would be interesting to check two monitoring cores to see if they're receiving one another's packets, what the ratio is of dropped packets to total packets sent, and if any of the packets that make it through have anything in common with one another. Might give you some clues if nothing else does.

I know, it's not much, but it's something.


Thanks so much for your help. I suppose this only makes me think that there is
some networking issue, hardware or software, but I have no idea what it is at
this point.

Well, the only thing harder than troubleshooting your own hardware is troubleshooting someone else's. :)


Reply via email to