Ganglia-General,

I'm trying to install ganglia on a linux cluster. Each node is a dual-cpu 1.4 GHz P3 system. The manager has local disks, is running the 2.4.17 kernel, and has two ethernet interfaces (eth0=10.1.1.1 going to the compute nodes and eth1 going to the outside). The compute nodes are diskless and are running the 2.4.3 kernel and have only one ethernet interface each (10.1.1.*). Manager and compute nodes are connected with a 100baseT switch. Everything went normally on the manager installation. gmond came up and I can see the manager node with gstat (and gmetad and webfrontend).
# gstat -a
CLUSTER INFORMATION
      Name: unspecified
     Hosts: 2
Gexec Hosts: 0
Dead Hosts: 0
 Localtime: Wed Mar 12 11:23:51 2003

CLUSTER HOSTS
Hostname LOAD CPU Gexec
CPUs (Procs/Total) [     1,     5, 15min] [  User,  Nice, System, Idle]

batt001
   2 (    3/  119) [  0.86,  0.28,  0.31] [  13.8,   0.0,  12.1,  77.5] OFF


When I installed it on a compute node (batt016), it segfaulted until I added the route,

route add -host 239.2.11.71 dev eth0

[EMAIL PROTECTED] ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
239.2.11.71     *               255.255.255.255 UH    0      0        0 eth0
10.1.1.0        *               255.255.255.0   U     0      0        0 eth0
127.0.0.0       *               255.0.0.0       U     0      0        0 lo


and then it worked OK.

# gstat -a
CLUSTER INFORMATION
      Name: unspecified
     Hosts: 1
Gexec Hosts: 0
Dead Hosts: 1
 Localtime: Wed Mar 12 11:24:11 2003

CLUSTER HOSTS
Hostname LOAD CPU Gexec
CPUs (Procs/Total) [     1,     5, 15min] [  User,  Nice, System, Idle]

batt016
   2 (    0/   33) [  0.13,  0.03,  0.00] [   0.6,   0.0,   0.0, 100.0] OFF

Shortly after starting gmond on the compute node, it appears in the manager's gstat output with partial information,

# gstat -a
CLUSTER INFORMATION
      Name: unspecified
     Hosts: 2
Gexec Hosts: 0
Dead Hosts: 0
 Localtime: Wed Mar 12 11:23:51 2003

CLUSTER HOSTS
Hostname LOAD CPU Gexec
CPUs (Procs/Total) [     1,     5, 15min] [  User,  Nice, System, Idle]

batt001
   2 (    3/  119) [  0.86,  0.28,  0.31] [  13.8,   0.0,  12.1,  77.5] OFF
batt016
   0 (    0/    0) [  0.00,  0.00,  0.00] [   0.0,   0.0,   0.0,   0.0] OFF


but after a few minutes, it is declared dead,

# gstat -d
CLUSTER INFORMATION
      Name: unspecified
     Hosts: 1
Gexec Hosts: 0
Dead Hosts: 1
 Localtime: Wed Mar 12 11:26:39 2003

DEAD CLUSTER HOSTS
                       Hostname   Last Reported
                        batt016   Wed Mar 12 11:25:17 2003

On the compute node, gstat never shows any information about the manager node.


When I ping the multicast address from the manager, it usually only gets responses from itself, though every once in a while it will see a response from the compute node:

# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 10.1.1.1 : 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=68 usec
64 bytes from 10.1.1.1: icmp_seq=1 ttl=255 time=26 usec
64 bytes from 10.1.1.1: icmp_seq=2 ttl=255 time=44 usec
64 bytes from 10.1.1.16: icmp_seq=2 ttl=255 time=166 usec (DUP!)

--- 239.2.11.71 ping statistics ---
3 packets transmitted, 3 packets received, +1 duplicates, 0% packet loss
round-trip min/avg/max/mdev = 0.026/0.076/0.166/0.054 ms

# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 10.1.1.1 : 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=57 usec
64 bytes from 10.1.1.1: icmp_seq=1 ttl=255 time=37 usec
64 bytes from 10.1.1.1: icmp_seq=2 ttl=255 time=33 usec
64 bytes from 10.1.1.1: icmp_seq=3 ttl=255 time=33 usec
64 bytes from 10.1.1.1: icmp_seq=4 ttl=255 time=36 usec
64 bytes from 10.1.1.1: icmp_seq=5 ttl=255 time=36 usec
64 bytes from 10.1.1.1: icmp_seq=6 ttl=255 time=31 usec
64 bytes from 10.1.1.1: icmp_seq=7 ttl=255 time=31 usec
64 bytes from 10.1.1.1: icmp_seq=8 ttl=255 time=28 usec
64 bytes from 10.1.1.1: icmp_seq=9 ttl=255 time=44 usec
64 bytes from 10.1.1.1: icmp_seq=10 ttl=255 time=30 usec
64 bytes from 10.1.1.1: icmp_seq=11 ttl=255 time=31 usec
64 bytes from 10.1.1.1: icmp_seq=12 ttl=255 time=30 usec


Here is the routing table on the manager:

[EMAIL PROTECTED] /]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
239.2.11.71     *               255.255.255.255 UH    0      0        0 eth0
10.1.1.0        *               255.255.255.0   U     0      0        0 eth0
192.168.0.0     *               255.255.240.0   U     0      0        0 eth1
127.0.0.0       *               255.0.0.0       U     0      0        0 lo
224.0.0.0       *               240.0.0.0       U     0      0        0 eth0
default         gtwy            0.0.0.0         UG    0      0        0 eth1


Pinging the multicast address from the compute node only gets responses from itself,

[EMAIL PROTECTED] ~]#  ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 10.1.1.16 : 56(84) bytes of data.
64 bytes from batt016 (10.1.1.16): icmp_seq=0 ttl=255 time=45 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=1 ttl=255 time=13 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=2 ttl=255 time=9 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=3 ttl=255 time=8 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=4 ttl=255 time=7 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=5 ttl=255 time=7 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=6 ttl=255 time=8 usec

When I ping the entire multicast network from the manager, I see responses from all nodes,

[EMAIL PROTECTED] /]# ping 224.0.0.1
PING 224.0.0.1 (224.0.0.1) from 10.1.1.1 : 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=68 usec
64 bytes from 10.1.1.14: icmp_seq=0 ttl=255 time=181 usec (DUP!)
64 bytes from 10.1.1.12: icmp_seq=0 ttl=255 time=183 usec (DUP!)
64 bytes from 10.1.1.10: icmp_seq=0 ttl=255 time=196 usec (DUP!)
64 bytes from 10.1.1.7: icmp_seq=0 ttl=255 time=207 usec (DUP!)
64 bytes from 10.1.1.6: icmp_seq=0 ttl=255 time=210 usec (DUP!)
64 bytes from 10.1.1.9: icmp_seq=0 ttl=255 time=225 usec (DUP!)
64 bytes from 10.1.1.11: icmp_seq=0 ttl=255 time=234 usec (DUP!)
64 bytes from 10.1.1.13: icmp_seq=0 ttl=255 time=244 usec (DUP!)
64 bytes from 10.1.1.8: icmp_seq=0 ttl=255 time=254 usec (DUP!)
64 bytes from 10.1.1.15: icmp_seq=0 ttl=255 time=263 usec (DUP!)
64 bytes from 10.1.1.16: icmp_seq=0 ttl=255 time=273 usec (DUP!)
64 bytes from 10.1.1.2: icmp_seq=0 ttl=255 time=283 usec (DUP!)
64 bytes from 10.1.1.4: icmp_seq=0 ttl=255 time=293 usec (DUP!)
64 bytes from 10.1.1.5: icmp_seq=0 ttl=255 time=302 usec (DUP!)
64 bytes from 10.1.1.3: icmp_seq=0 ttl=255 time=312 usec (DUP!)
64 bytes from 10.1.1.251: icmp_seq=0 ttl=255 time=530 usec (DUP!)


On the compute node, I can't ping the entire network,

[EMAIL PROTECTED] ~]# ping 224.0.0.1
connect: Network is unreachable

unless I add that route,


[EMAIL PROTECTED] ~]# route add -net 224.0.0.0 netmask 240.0.0.0 dev eth0
[EMAIL PROTECTED] ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
224.0.0.0       *               255.255.255.255 UH    0      0        0 eth0
239.2.11.71     *               255.255.255.255 UH    0      0        0 eth0
10.1.1.0        *               255.255.255.0   U     0      0        0 eth0
127.0.0.0       *               255.0.0.0       U     0      0        0 lo
224.0.0.0       *               240.0.0.0       U     0      0        0 eth0

then it can see all the hosts,

[EMAIL PROTECTED] ~]# ping 224.0.0.1
PING 224.0.0.1 (224.0.0.1) from 10.1.1.16 : 56(84) bytes of data.
64 bytes from batt016 (10.1.1.16): icmp_seq=0 ttl=255 time=49 usec
64 bytes from batt001 (10.1.1.1): icmp_seq=0 ttl=255 time=270 usec (DUP!)
64 bytes from batt008 (10.1.1.8): icmp_seq=0 ttl=255 time=308 usec (DUP!)
64 bytes from batt006 (10.1.1.6): icmp_seq=0 ttl=255 time=321 usec (DUP!)
64 bytes from batt010 (10.1.1.10): icmp_seq=0 ttl=255 time=351 usec (DUP!)
64 bytes from batt009 (10.1.1.9): icmp_seq=0 ttl=255 time=369 usec (DUP!)
64 bytes from batt015 (10.1.1.15): icmp_seq=0 ttl=255 time=390 usec (DUP!)
64 bytes from batt014 (10.1.1.14): icmp_seq=0 ttl=255 time=412 usec (DUP!)
64 bytes from batt013 (10.1.1.13): icmp_seq=0 ttl=255 time=420 usec (DUP!)
64 bytes from batt011 (10.1.1.11): icmp_seq=0 ttl=255 time=431 usec (DUP!)
64 bytes from batt002 (10.1.1.2): icmp_seq=0 ttl=255 time=456 usec (DUP!)
64 bytes from batt007 (10.1.1.7): icmp_seq=0 ttl=255 time=471 usec (DUP!)
64 bytes from batt012 (10.1.1.12): icmp_seq=0 ttl=255 time=513 usec (DUP!)
64 bytes from batt003 (10.1.1.3): icmp_seq=0 ttl=255 time=523 usec (DUP!)
64 bytes from batt005 (10.1.1.5): icmp_seq=0 ttl=255 time=532 usec (DUP!)
64 bytes from batt004 (10.1.1.4): icmp_seq=0 ttl=255 time=548 usec (DUP!)
64 bytes from batt251 (10.1.1.251): icmp_seq=0 ttl=255 time=619 usec (DUP!)

I have a second cluster on which I've installed ROCKS, and gmond works there. Each node can see all the others with gstat, and pinging 239.2.11.71 from any node gets responses from all nodes. I can't figure out why one system works and the other doesn't. The routing tables are the same. The kernels are different because I need nfs-root support for my diskless compute nodes, but I don't see why that should matter. It seems like there's something intermittent about the multicast connection from compute node to manager. Some partial data gets through on the first attempt, but after that nothing gets through. And nothing ever goes from the manager to the compute node. Have I made a mistake in my multicast configuration? I've tried several different versions of gmond, 2.5.1-1, 2.5.1-3, and 2.5.3-1, and they all have the same behavior. I've tried running in debug mode but that hasn't turned up any smoking guns.

Can anyone suggest what's going wrong and how to fix it?

Thanks!


Reply via email to