Thanks Matt,

gstat works fine.  It is just that I see
my cluster nodes pop in and out.
tcpdump shows:

09:04:45.570837 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:04:45.570874 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:04:48.600781 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:04:52.640806 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:04:55.670867 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:04:56.680793 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:04:56.680845 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:02.579355 msc1.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:02.740780 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:02.740829 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:03.327730 msc2.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:04.290337 msc3.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:06.780893 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:08.532001 msc4.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:09.810775 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:09.810841 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:10.820780 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:11.830776 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:12.840790 ishow.cluster.32768 > 239.2.11.71.8649:  udp 12 (DF) [ttl 1]
09:05:12.840825 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:13.850769 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:13.850797 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:16.880785 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:23.950762 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:26.980842 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:26.980936 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:27.990801 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:29.000814 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:29.000842 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:30.010831 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:31.020794 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:33.040779 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:33.040822 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:33.040855 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:35.060797 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:37.080781 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:39.100805 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:40.110835 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:43.140786 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:44.150803 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:46.170827 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:47.180813 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]
09:05:57.280914 ishow.cluster.32768 > 239.2.11.71.8649:  udp 8 (DF) [ttl 1]


Where "ishow" is my head node, and
"msc[1234]" are the cluster nodes.
At 9:05 the cluster nodes could be
seen, but then were lost again.  I
recall once before having a switch
with multicast settings incorrect.
I will check that.

BTW - I wrote a webmin module for
ganglia that works on a Zaraus Palm pilot.
It starts up with the graph of the
cluster, and selection drop down menus
for NODE and Metric.  Once NODE and METRIC
are selected, that graph is shown.

Regards,
Joe

matt massie wrote:

joe-

gstat is a commandline tool to give you a quick look at the hosts you are monitoring. run gstat from the commandline prompt in order to see it's output (gstat --help will give you more info).

as far as the graph is concerned.. it's hard to tell what is going on. the graphs are created by rrdtool from data that is stored by gmetad. if you don't see errors in the syslog for the host running gmetads, than i suspect it might be a problem with multicast support on the network. we can use the gstat commandline tool as a test.. on the host running gmetad (and gmond to listen to the multicast traffic).. run

% gstat --dead

to list all the dead hosts. if over time you see hosts pop in and out of this list then you know that multicast traffic is getting lost and ganglia thinks the host has died a horrible death.
tcpdump could also help.. running

% tcpdump net 239.2.11.71

(substitute 239.2.11.71 with the whatever multicast address you are using.. 239.2.11.71 is the default)
tcpdump will list all the ganglia multicast traffic in real time.. you
should see every host you are monitoring in the list of machines.

i'm sure will a little work we'll find the source of the problem.


Reply via email to