Ganglia-General,
I'm trying to install ganglia on a linux cluster. Each node is a
dual-cpu 1.4 GHz P3 system. The manager has local disks, is running the
2.4.17 kernel, and has two ethernet interfaces (eth0=10.1.1.1 going to
the compute nodes and eth1 going to the outside). The compute nodes are
diskless and are running the 2.4.3 kernel and have only one ethernet
interface each (10.1.1.*). Manager and compute nodes are connected with
a 100baseT switch.
Everything went normally on the manager installation. gmond came up and
I can see the manager node with gstat (and gmetad and webfrontend).
# gstat -a
CLUSTER INFORMATION
Name: unspecified
Hosts: 2
Gexec Hosts: 0
Dead Hosts: 0
Localtime: Wed Mar 12 11:23:51 2003
CLUSTER HOSTS
Hostname LOAD CPU
Gexec
CPUs (Procs/Total) [ 1, 5, 15min] [ User, Nice, System, Idle]
batt001
2 ( 3/ 119) [ 0.86, 0.28, 0.31] [ 13.8, 0.0, 12.1, 77.5] OFF
When I installed it on a compute node (batt016), it segfaulted until I
added the route,
route add -host 239.2.11.71 dev eth0
[EMAIL PROTECTED] ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use
Iface
239.2.11.71 * 255.255.255.255 UH 0 0 0 eth0
10.1.1.0 * 255.255.255.0 U 0 0 0 eth0
127.0.0.0 * 255.0.0.0 U 0 0 0 lo
and then it worked OK.
# gstat -a
CLUSTER INFORMATION
Name: unspecified
Hosts: 1
Gexec Hosts: 0
Dead Hosts: 1
Localtime: Wed Mar 12 11:24:11 2003
CLUSTER HOSTS
Hostname LOAD CPU
Gexec
CPUs (Procs/Total) [ 1, 5, 15min] [ User, Nice, System, Idle]
batt016
2 ( 0/ 33) [ 0.13, 0.03, 0.00] [ 0.6, 0.0, 0.0, 100.0] OFF
Shortly after starting gmond on the compute node, it appears in the
manager's gstat output with partial information,
# gstat -a
CLUSTER INFORMATION
Name: unspecified
Hosts: 2
Gexec Hosts: 0
Dead Hosts: 0
Localtime: Wed Mar 12 11:23:51 2003
CLUSTER HOSTS
Hostname LOAD CPU
Gexec
CPUs (Procs/Total) [ 1, 5, 15min] [ User, Nice, System, Idle]
batt001
2 ( 3/ 119) [ 0.86, 0.28, 0.31] [ 13.8, 0.0, 12.1, 77.5] OFF
batt016
0 ( 0/ 0) [ 0.00, 0.00, 0.00] [ 0.0, 0.0, 0.0, 0.0] OFF
but after a few minutes, it is declared dead,
# gstat -d
CLUSTER INFORMATION
Name: unspecified
Hosts: 1
Gexec Hosts: 0
Dead Hosts: 1
Localtime: Wed Mar 12 11:26:39 2003
DEAD CLUSTER HOSTS
Hostname Last Reported
batt016 Wed Mar 12 11:25:17 2003
On the compute node, gstat never shows any information about the manager
node.
When I ping the multicast address from the manager, it usually only gets
responses from itself, though every once in a while it will see a
response from the compute node:
# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 10.1.1.1 : 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=68 usec
64 bytes from 10.1.1.1: icmp_seq=1 ttl=255 time=26 usec
64 bytes from 10.1.1.1: icmp_seq=2 ttl=255 time=44 usec
64 bytes from 10.1.1.16: icmp_seq=2 ttl=255 time=166 usec (DUP!)
--- 239.2.11.71 ping statistics ---
3 packets transmitted, 3 packets received, +1 duplicates, 0% packet loss
round-trip min/avg/max/mdev = 0.026/0.076/0.166/0.054 ms
# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 10.1.1.1 : 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=57 usec
64 bytes from 10.1.1.1: icmp_seq=1 ttl=255 time=37 usec
64 bytes from 10.1.1.1: icmp_seq=2 ttl=255 time=33 usec
64 bytes from 10.1.1.1: icmp_seq=3 ttl=255 time=33 usec
64 bytes from 10.1.1.1: icmp_seq=4 ttl=255 time=36 usec
64 bytes from 10.1.1.1: icmp_seq=5 ttl=255 time=36 usec
64 bytes from 10.1.1.1: icmp_seq=6 ttl=255 time=31 usec
64 bytes from 10.1.1.1: icmp_seq=7 ttl=255 time=31 usec
64 bytes from 10.1.1.1: icmp_seq=8 ttl=255 time=28 usec
64 bytes from 10.1.1.1: icmp_seq=9 ttl=255 time=44 usec
64 bytes from 10.1.1.1: icmp_seq=10 ttl=255 time=30 usec
64 bytes from 10.1.1.1: icmp_seq=11 ttl=255 time=31 usec
64 bytes from 10.1.1.1: icmp_seq=12 ttl=255 time=30 usec
Here is the routing table on the manager:
[EMAIL PROTECTED] /]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use
Iface
239.2.11.71 * 255.255.255.255 UH 0 0 0 eth0
10.1.1.0 * 255.255.255.0 U 0 0 0 eth0
192.168.0.0 * 255.255.240.0 U 0 0 0 eth1
127.0.0.0 * 255.0.0.0 U 0 0 0 lo
224.0.0.0 * 240.0.0.0 U 0 0 0 eth0
default gtwy 0.0.0.0 UG 0 0 0 eth1
Pinging the multicast address from the compute node only gets responses
from itself,
[EMAIL PROTECTED] ~]# ping 239.2.11.71
PING 239.2.11.71 (239.2.11.71) from 10.1.1.16 : 56(84) bytes of data.
64 bytes from batt016 (10.1.1.16): icmp_seq=0 ttl=255 time=45 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=1 ttl=255 time=13 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=2 ttl=255 time=9 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=3 ttl=255 time=8 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=4 ttl=255 time=7 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=5 ttl=255 time=7 usec
64 bytes from batt016 (10.1.1.16): icmp_seq=6 ttl=255 time=8 usec
When I ping the entire multicast network from the manager, I see
responses from all nodes,
[EMAIL PROTECTED] /]# ping 224.0.0.1
PING 224.0.0.1 (224.0.0.1) from 10.1.1.1 : 56(84) bytes of data.
64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=68 usec
64 bytes from 10.1.1.14: icmp_seq=0 ttl=255 time=181 usec (DUP!)
64 bytes from 10.1.1.12: icmp_seq=0 ttl=255 time=183 usec (DUP!)
64 bytes from 10.1.1.10: icmp_seq=0 ttl=255 time=196 usec (DUP!)
64 bytes from 10.1.1.7: icmp_seq=0 ttl=255 time=207 usec (DUP!)
64 bytes from 10.1.1.6: icmp_seq=0 ttl=255 time=210 usec (DUP!)
64 bytes from 10.1.1.9: icmp_seq=0 ttl=255 time=225 usec (DUP!)
64 bytes from 10.1.1.11: icmp_seq=0 ttl=255 time=234 usec (DUP!)
64 bytes from 10.1.1.13: icmp_seq=0 ttl=255 time=244 usec (DUP!)
64 bytes from 10.1.1.8: icmp_seq=0 ttl=255 time=254 usec (DUP!)
64 bytes from 10.1.1.15: icmp_seq=0 ttl=255 time=263 usec (DUP!)
64 bytes from 10.1.1.16: icmp_seq=0 ttl=255 time=273 usec (DUP!)
64 bytes from 10.1.1.2: icmp_seq=0 ttl=255 time=283 usec (DUP!)
64 bytes from 10.1.1.4: icmp_seq=0 ttl=255 time=293 usec (DUP!)
64 bytes from 10.1.1.5: icmp_seq=0 ttl=255 time=302 usec (DUP!)
64 bytes from 10.1.1.3: icmp_seq=0 ttl=255 time=312 usec (DUP!)
64 bytes from 10.1.1.251: icmp_seq=0 ttl=255 time=530 usec (DUP!)
On the compute node, I can't ping the entire network,
[EMAIL PROTECTED] ~]# ping 224.0.0.1
connect: Network is unreachable
unless I add that route,
[EMAIL PROTECTED] ~]# route add -net 224.0.0.0 netmask 240.0.0.0 dev eth0
[EMAIL PROTECTED] ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use
Iface
224.0.0.0 * 255.255.255.255 UH 0 0 0 eth0
239.2.11.71 * 255.255.255.255 UH 0 0 0 eth0
10.1.1.0 * 255.255.255.0 U 0 0 0 eth0
127.0.0.0 * 255.0.0.0 U 0 0 0 lo
224.0.0.0 * 240.0.0.0 U 0 0 0 eth0
then it can see all the hosts,
[EMAIL PROTECTED] ~]# ping 224.0.0.1
PING 224.0.0.1 (224.0.0.1) from 10.1.1.16 : 56(84) bytes of data.
64 bytes from batt016 (10.1.1.16): icmp_seq=0 ttl=255 time=49 usec
64 bytes from batt001 (10.1.1.1): icmp_seq=0 ttl=255 time=270 usec (DUP!)
64 bytes from batt008 (10.1.1.8): icmp_seq=0 ttl=255 time=308 usec (DUP!)
64 bytes from batt006 (10.1.1.6): icmp_seq=0 ttl=255 time=321 usec (DUP!)
64 bytes from batt010 (10.1.1.10): icmp_seq=0 ttl=255 time=351 usec (DUP!)
64 bytes from batt009 (10.1.1.9): icmp_seq=0 ttl=255 time=369 usec (DUP!)
64 bytes from batt015 (10.1.1.15): icmp_seq=0 ttl=255 time=390 usec (DUP!)
64 bytes from batt014 (10.1.1.14): icmp_seq=0 ttl=255 time=412 usec (DUP!)
64 bytes from batt013 (10.1.1.13): icmp_seq=0 ttl=255 time=420 usec (DUP!)
64 bytes from batt011 (10.1.1.11): icmp_seq=0 ttl=255 time=431 usec (DUP!)
64 bytes from batt002 (10.1.1.2): icmp_seq=0 ttl=255 time=456 usec (DUP!)
64 bytes from batt007 (10.1.1.7): icmp_seq=0 ttl=255 time=471 usec (DUP!)
64 bytes from batt012 (10.1.1.12): icmp_seq=0 ttl=255 time=513 usec (DUP!)
64 bytes from batt003 (10.1.1.3): icmp_seq=0 ttl=255 time=523 usec (DUP!)
64 bytes from batt005 (10.1.1.5): icmp_seq=0 ttl=255 time=532 usec (DUP!)
64 bytes from batt004 (10.1.1.4): icmp_seq=0 ttl=255 time=548 usec (DUP!)
64 bytes from batt251 (10.1.1.251): icmp_seq=0 ttl=255 time=619 usec (DUP!)
I have a second cluster on which I've installed ROCKS, and gmond works
there. Each node can see all the others with gstat, and pinging
239.2.11.71 from any node gets responses from all nodes. I can't figure
out why one system works and the other doesn't. The routing tables are
the same. The kernels are different because I need nfs-root support for
my diskless compute nodes, but I don't see why that should matter.
It seems like there's something intermittent about the multicast
connection from compute node to manager. Some partial data gets through
on the first attempt, but after that nothing gets through. And nothing
ever goes from the manager to the compute node. Have I made a mistake
in my multicast configuration?
I've tried several different versions of gmond, 2.5.1-1, 2.5.1-3, and
2.5.3-1, and they all have the same behavior. I've tried running in
debug mode but that hasn't turned up any smoking guns.
Can anyone suggest what's going wrong and how to fix it?
Thanks!