[Ganglia-general] Gmond node reports and then disappears

Marc Rieffel Wed, 12 Mar 2003 11:53:18 -0800

Ganglia-General,

I'm trying to install ganglia on a linux cluster. Each node is adual-cpu 1.4 GHz P3 system. The manager has local disks, is running the2.4.17 kernel, and has two ethernet interfaces (eth0=10.1.1.1 going tothe compute nodes and eth1 going to the outside). The compute nodes arediskless and are running the 2.4.3 kernel and have only one ethernetinterface each (10.1.1.*). Manager and compute nodes are connected witha 100baseT switch.Everything went normally on the manager installation. gmond came up andI can see the manager node with gstat (and gmetad and webfrontend).

# gstat -a
CLUSTER INFORMATION
      Name: unspecified
     Hosts: 2
Gexec Hosts: 0
Dead Hosts: 0
 Localtime: Wed Mar 12 11:23:51 2003


CLUSTER HOSTS

Hostname LOAD CPUGexec

CPUs (Procs/Total) [     1,     5, 15min] [  User,  Nice, System, Idle]

batt001
   2 (    3/  119) [  0.86,  0.28,  0.31] [  13.8,   0.0,  12.1,  77.5] OFF

When I installed it on a compute node (batt016), it segfaulted until Iadded the route,


route add -host 239.2.11.71 dev eth0

[EMAIL PROTECTED] ~]# route
Kernel IP routing table

Destination Gateway Genmask Flags Metric Ref UseIface

239.2.11.71     *               255.255.255.255 UH    0      0        0 eth0
10.1.1.0        *               255.255.255.0   U     0      0        0 eth0
127.0.0.0       *               255.0.0.0       U     0      0        0 lo


and then it worked OK.

# gstat -a
CLUSTER INFORMATION
      Name: unspecified
     Hosts: 1
Gexec Hosts: 0
Dead Hosts: 1
 Localtime: Wed Mar 12 11:24:11 2003

CLUSTER HOSTS

Hostname LOAD CPUGexec

CPUs (Procs/Total) [     1,     5, 15min] [  User,  Nice, System, Idle]

batt016
   2 (    0/   33) [  0.13,  0.03,  0.00] [   0.6,   0.0,   0.0, 100.0] OFF

Shortly after starting gmond on the compute node, it appears in themanager's gstat output with partial information,


# gstat -a
CLUSTER INFORMATION
      Name: unspecified
     Hosts: 2
Gexec Hosts: 0
Dead Hosts: 0
 Localtime: Wed Mar 12 11:23:51 2003

CLUSTER HOSTS

Hostname LOAD CPUGexec

CPUs (Procs/Total) [     1,     5, 15min] [  User,  Nice, System, Idle]

batt001
   2 (    3/  119) [  0.86,  0.28,  0.31] [  13.8,   0.0,  12.1,  77.5] OFF
batt016
   0 (    0/    0) [  0.00,  0.00,  0.00] [   0.0,   0.0,   0.0,   0.0] OFF


but after a few minutes, it is declared dead,

# gstat -d
CLUSTER INFORMATION
      Name: unspecified
     Hosts: 1
Gexec Hosts: 0
Dead Hosts: 1
 Localtime: Wed Mar 12 11:26:39 2003

DEAD CLUSTER HOSTS
                       Hostname   Last Reported
                        batt016   Wed Mar 12 11:25:17 2003

On the compute node, gstat never shows any information about the managernode.

When I ping the multicast address from the manager, it usually only getsresponses from itself, though every once in a while it will see aresponse from the compute node:


# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 10.1.1.1 : 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=68 usec
64 bytes from 10.1.1.1: icmp_seq=1 ttl=255 time=26 usec
64 bytes from 10.1.1.1: icmp_seq=2 ttl=255 time=44 usec
64 bytes from 10.1.1.16: icmp_seq=2 ttl=255 time=166 usec (DUP!)

--- 239.2.11.71 ping statistics ---
3 packets transmitted, 3 packets received, +1 duplicates, 0% packet loss
round-trip min/avg/max/mdev = 0.026/0.076/0.166/0.054 ms

# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 10.1.1.1 : 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=57 usec
64 bytes from 10.1.1.1: icmp_seq=1 ttl=255 time=37 usec
64 bytes from 10.1.1.1: icmp_seq=2 ttl=255 time=33 usec
64 bytes from 10.1.1.1: icmp_seq=3 ttl=255 time=33 usec
64 bytes from 10.1.1.1: icmp_seq=4 ttl=255 time=36 usec
64 bytes from 10.1.1.1: icmp_seq=5 ttl=255 time=36 usec
64 bytes from 10.1.1.1: icmp_seq=6 ttl=255 time=31 usec
64 bytes from 10.1.1.1: icmp_seq=7 ttl=255 time=31 usec
64 bytes from 10.1.1.1: icmp_seq=8 ttl=255 time=28 usec
64 bytes from 10.1.1.1: icmp_seq=9 ttl=255 time=44 usec
64 bytes from 10.1.1.1: icmp_seq=10 ttl=255 time=30 usec
64 bytes from 10.1.1.1: icmp_seq=11 ttl=255 time=31 usec
64 bytes from 10.1.1.1: icmp_seq=12 ttl=255 time=30 usec


Here is the routing table on the manager:

[EMAIL PROTECTED] /]# route
Kernel IP routing table

Destination Gateway Genmask Flags Metric Ref UseIface

239.2.11.71     *               255.255.255.255 UH    0      0        0 eth0
10.1.1.0        *               255.255.255.0   U     0      0        0 eth0
192.168.0.0     *               255.255.240.0   U     0      0        0 eth1
127.0.0.0       *               255.0.0.0       U     0      0        0 lo
224.0.0.0       *               240.0.0.0       U     0      0        0 eth0
default         gtwy            0.0.0.0         UG    0      0        0 eth1

Pinging the multicast address from the compute node only gets responsesfrom itself,


[EMAIL PROTECTED] ~]#  ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 10.1.1.16 : 56(84) bytes of data.
64 bytes from batt016 (10.1.1.16): icmp_seq=0 ttl=255 time=45 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=1 ttl=255 time=13 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=2 ttl=255 time=9 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=3 ttl=255 time=8 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=4 ttl=255 time=7 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=5 ttl=255 time=7 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=6 ttl=255 time=8 usec

When I ping the entire multicast network from the manager, I seeresponses from all nodes,


[EMAIL PROTECTED] /]# ping 224.0.0.1
PING 224.0.0.1 (224.0.0.1) from 10.1.1.1 : 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=68 usec
64 bytes from 10.1.1.14: icmp_seq=0 ttl=255 time=181 usec (DUP!)
64 bytes from 10.1.1.12: icmp_seq=0 ttl=255 time=183 usec (DUP!)
64 bytes from 10.1.1.10: icmp_seq=0 ttl=255 time=196 usec (DUP!)
64 bytes from 10.1.1.7: icmp_seq=0 ttl=255 time=207 usec (DUP!)
64 bytes from 10.1.1.6: icmp_seq=0 ttl=255 time=210 usec (DUP!)
64 bytes from 10.1.1.9: icmp_seq=0 ttl=255 time=225 usec (DUP!)
64 bytes from 10.1.1.11: icmp_seq=0 ttl=255 time=234 usec (DUP!)
64 bytes from 10.1.1.13: icmp_seq=0 ttl=255 time=244 usec (DUP!)
64 bytes from 10.1.1.8: icmp_seq=0 ttl=255 time=254 usec (DUP!)
64 bytes from 10.1.1.15: icmp_seq=0 ttl=255 time=263 usec (DUP!)
64 bytes from 10.1.1.16: icmp_seq=0 ttl=255 time=273 usec (DUP!)
64 bytes from 10.1.1.2: icmp_seq=0 ttl=255 time=283 usec (DUP!)
64 bytes from 10.1.1.4: icmp_seq=0 ttl=255 time=293 usec (DUP!)
64 bytes from 10.1.1.5: icmp_seq=0 ttl=255 time=302 usec (DUP!)
64 bytes from 10.1.1.3: icmp_seq=0 ttl=255 time=312 usec (DUP!)
64 bytes from 10.1.1.251: icmp_seq=0 ttl=255 time=530 usec (DUP!)


On the compute node, I can't ping the entire network,

[EMAIL PROTECTED] ~]# ping 224.0.0.1
connect: Network is unreachable

unless I add that route,


[EMAIL PROTECTED] ~]# route add -net 224.0.0.0 netmask 240.0.0.0 dev eth0
[EMAIL PROTECTED] ~]# route
Kernel IP routing table

Destination Gateway Genmask Flags Metric Ref UseIface

224.0.0.0       *               255.255.255.255 UH    0      0        0 eth0
239.2.11.71     *               255.255.255.255 UH    0      0        0 eth0
10.1.1.0        *               255.255.255.0   U     0      0        0 eth0
127.0.0.0       *               255.0.0.0       U     0      0        0 lo
224.0.0.0       *               240.0.0.0       U     0      0        0 eth0

then it can see all the hosts,

[EMAIL PROTECTED] ~]# ping 224.0.0.1
PING 224.0.0.1 (224.0.0.1) from 10.1.1.16 : 56(84) bytes of data.
64 bytes from batt016 (10.1.1.16): icmp_seq=0 ttl=255 time=49 usec
64 bytes from batt001 (10.1.1.1): icmp_seq=0 ttl=255 time=270 usec (DUP!)
64 bytes from batt008 (10.1.1.8): icmp_seq=0 ttl=255 time=308 usec (DUP!)
64 bytes from batt006 (10.1.1.6): icmp_seq=0 ttl=255 time=321 usec (DUP!)
64 bytes from batt010 (10.1.1.10): icmp_seq=0 ttl=255 time=351 usec (DUP!)
64 bytes from batt009 (10.1.1.9): icmp_seq=0 ttl=255 time=369 usec (DUP!)
64 bytes from batt015 (10.1.1.15): icmp_seq=0 ttl=255 time=390 usec (DUP!)
64 bytes from batt014 (10.1.1.14): icmp_seq=0 ttl=255 time=412 usec (DUP!)
64 bytes from batt013 (10.1.1.13): icmp_seq=0 ttl=255 time=420 usec (DUP!)
64 bytes from batt011 (10.1.1.11): icmp_seq=0 ttl=255 time=431 usec (DUP!)
64 bytes from batt002 (10.1.1.2): icmp_seq=0 ttl=255 time=456 usec (DUP!)
64 bytes from batt007 (10.1.1.7): icmp_seq=0 ttl=255 time=471 usec (DUP!)
64 bytes from batt012 (10.1.1.12): icmp_seq=0 ttl=255 time=513 usec (DUP!)
64 bytes from batt003 (10.1.1.3): icmp_seq=0 ttl=255 time=523 usec (DUP!)
64 bytes from batt005 (10.1.1.5): icmp_seq=0 ttl=255 time=532 usec (DUP!)
64 bytes from batt004 (10.1.1.4): icmp_seq=0 ttl=255 time=548 usec (DUP!)
64 bytes from batt251 (10.1.1.251): icmp_seq=0 ttl=255 time=619 usec (DUP!)

I have a second cluster on which I've installed ROCKS, and gmond worksthere. Each node can see all the others with gstat, and pinging239.2.11.71 from any node gets responses from all nodes. I can't figureout why one system works and the other doesn't. The routing tables arethe same. The kernels are different because I need nfs-root support formy diskless compute nodes, but I don't see why that should matter.It seems like there's something intermittent about the multicastconnection from compute node to manager. Some partial data gets throughon the first attempt, but after that nothing gets through. And nothingever goes from the manager to the compute node. Have I made a mistakein my multicast configuration?I've tried several different versions of gmond, 2.5.1-1, 2.5.1-3, and2.5.3-1, and they all have the same behavior. I've tried running indebug mode but that hasn't turned up any smoking guns.


Can anyone suggest what's going wrong and how to fix it?

Thanks!

[Ganglia-general] Gmond node reports and then disappears

Reply via email to