Re: [Ganglia-general] nodes running gmond reporting incorrectly

steven wagner Wed, 02 Jul 2003 14:48:08 -0700

Kevin James Flasch wrote:

*  Check some of the gmond-only nodes' XML port output.  How many nodes
do they see?  Do they see 289-295 nodes or just their own output?



I believe you're referring to the mcast_port (by default 8649). When I telnet
to it, I see what appears to be all/most of them.
(`telnet localhost 8649 | grep "<HOST " | wc -l`  gives me 300).

wc -l's a good start but you should actually check each host's timestampvalue. If the timestamps are fairly close to one another and close toNOW(), then you know that the monitoring core you're polling isreceiving packets from all 300 hosts often enough for them to beconsidered "up" - the REPORTED attribute is updated every time anymetric is received from a given host.

They are not all in the same subnet. There are two subnets that they reside
in.

They are all physically connected to the same switch.

There is no firewalling of the sort that blocks ports, drops packets on the
master. The idea that there is something wrong with the network connection
seems reasonable. I can't see anything outstanding about it, however, and
there have been no network problems with the connection otherwise.


So far so good...

*  Consider polling a different set of monitoring cores as your gmetad
cluster data source.



I'm not sure I follow. Can you explain or give an example, please?

Sure. gmetad has a configuration file, /etc/gmetad.conf by default,that specifies data sources. gmetad considers each of these datasources to be a different cluster. You can specify a polling frequencyand a list of IP(:port) combos for each cluster. These will be checkedfrom left to right.


Example:

data_source mycluster 15 10.0.0.2 10.0.0.3:2463 10.0.0.4 10.0.0.5
data_source anothercluster 60 192.168.7.15

In order to debug gmetad, it helps to "see what the killer sees" bytelnetting to each of these sources in the same order from the noderunning the metadaemon. This should at least point you at themisbehaving monitoring core.

It may well be that the local monitoring core on the front-end is theone that's misconfigured somehow.

*  Run a monitoring core in debug mode.  You will see what metrics it's
sending and what metrics it's hearing on the multicast channel.



Hmm.. I'm not sure what the output of that should look like on node in a
functioning ganglia environment. It seems like it's communicating somewhat
with the other nodes, but most of the entries seem to be about itself. One of
the entries mentioning another machine look like this:

>
> Is that less data than typical?

On a 300-node Ganglia cluster you should be seeing at least load averagemetrics being multicast from every node every 15-60 seconds, plus thevarious other metrics according to their thresholds. Regardless, youshould see more than a packet every few seconds.

In fact if you didn't find it necessary to redirect the debug output toa file, you're probably not getting all the packets. :)

*  tcpdump.  Limit it to just the multicast IP or port and you should be
able to get all Ganglia-related traffic that the running host can hear.



That's what I did before to check the frequency of ganglia traffic. Most of
the traffic is the machine itself broadcasting 8 byte (ocassionally 12 byte)
udp packets on the multicast channel. Once and a while an 8 byte udp packet
from another node will come on the multicast channel (after every 5-15
originating packets on the multicast channel).

See above, you should be getting them more than once in a while. Itwould be interesting to check two monitoring cores to see if they'rereceiving one another's packets, what the ratio is of dropped packets tototal packets sent, and if any of the packets that make it through haveanything in common with one another. Might give you some clues ifnothing else does.

I know, it's not much, but it's something.



Thanks so much for your help. I suppose this only makes me think that there is
some networking issue, hardware or software, but I have no idea what it is at
this point.

Well, the only thing harder than troubleshooting your own hardware istroubleshooting someone else's. :)

Re: [Ganglia-general] nodes running gmond reporting incorrectly

Reply via email to