On Mon, 30 Jun 2003, steven wagner wrote:

> Kevin James Flasch wrote:
> > Does anyone have any idea what might be the problem? Any ideas at all would 
> > be
> > very much appreciated.
>
> Hi Kevin,
>
> For what it's worth, here are some general (i.e. not
> network-appliance-specific) tips for troubleshooting Ganglia metric
> transmission issues:
>
> *  Make a note of the multicast IP that your monitoring cores are
> transmitting to (make sure they all use the same one, too :) ).

239.2.11.71 and yes, all of the hosts use the same one.

> *  Check some of the gmond-only nodes' XML port output.  How many nodes
> do they see?  Do they see 289-295 nodes or just their own output?

I believe you're referring to the mcast_port (by default 8649). When I telnet
to it, I see what appears to be all/most of them.
(`telnet localhost 8649 | grep "<HOST " | wc -l`  gives me 300).

> *  Depending on how many other nodes' traffic you see on the XML, this
> can help you track down the problem.  Are they all in the same subnet?
> Are they all physically connected to the same switch?  If everything is
> showing up, then perhaps your gmetad/gmond combo box's network
> connection or firewall configuration is to blame.

They are not all in the same subnet. There are two subnets that they reside
in.

They are all physically connected to the same switch.

There is no firewalling of the sort that blocks ports, drops packets on the
master. The idea that there is something wrong with the network connection
seems reasonable. I can't see anything outstanding about it, however, and
there have been no network problems with the connection otherwise.


> *  Make sure that the monitoring core versions don't differ too wildly
> (mixed-platform gmonds also yield inaccurate metrics due to the
> differing metric hash indices on different platforms).

They are the same all across the board. 2.5.3-1 on x86.

> *  Consider polling a different set of monitoring cores as your gmetad
> cluster data source.

I'm not sure I follow. Can you explain or give an example, please?

> *  Run a monitoring core in debug mode.  You will see what metrics it's
> sending and what metrics it's hearing on the multicast channel.

Hmm.. I'm not sure what the output of that should look like on node in a
functioning ganglia environment. It seems like it's communicating somewhat
with the other nodes, but most of the entries seem to be about itself. One of
the entries mentioning another machine look like this:

1026 pre_process_node() remote_ip=129.89.200.61
pre_process_node() received a new node
pre_process_node() has set the timestamp
hash_create size = 36
hash->size is 37
pre_process_node() building custom metric hash size=16
hash_create size = 16
hash->size is 17
pre_process_node() initialized new hashes
pre_process_node() inserted data into cluster
pre_process_node() HOSTNAME =medusa-slave051.medusa.phys.uwm.edu
pre_process_node() TIMESTAMP=1057179215
pre_process_node() HASHP    =0x807dba0
pre_process_node() USER_HASHP=0x807dbb0
pre_process_node() received an old node
pre_process_node() updated the timestamp
mcast_value() mcasting heartbeat value
encoded 8 XDR bytes
mcast_listen_thread() got a 8 byte multicast message
mcast_listen_thread() received key 26
mcast_listen_thread() received metric data heartbeat

Is that less data than typical?

> *  tcpdump.  Limit it to just the multicast IP or port and you should be
> able to get all Ganglia-related traffic that the running host can hear.

That's what I did before to check the frequency of ganglia traffic. Most of
the traffic is the machine itself broadcasting 8 byte (ocassionally 12 byte)
udp packets on the multicast channel. Once and a while an 8 byte udp packet
from another node will come on the multicast channel (after every 5-15
originating packets on the multicast channel).

> I know, it's not much, but it's something.

Thanks so much for your help. I suppose this only makes me think that there is
some networking issue, hardware or software, but I have no idea what it is at
this point.

Kevin Flasch

>
> Good luck!
>
>
>
> -------------------------------------------------------
> This SF.Net email sponsored by: Free pre-built ASP.NET sites including
> Data Reports, E-commerce, Portals, and Forums are available now.
> Download today and enter to win an XBOX or Visual Studio .NET.
> http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01
> _______________________________________________
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>
>



Reply via email to