hi everybody,
I'm not completely sure whether this is a ganglia or a bonding issue,
so I'm sending the following to both lists.
I'm experiencing some problems trying to install the ganglia cluster
toolkit in a cluster of computers interconnected with bonded ethernet
through two 3com switches. Every machine has two NICs, each NIC is
connected to a different switch.
Ganglia makes use of multicasting for distributing the status of each
node among all the nodes. It works perfectly with 8 of the computers,
which are the computing nodes (two processors, memory, two NICs, no
HD)
Here you can see the contents of /proc/net/igmp, once I start ganglia
with"gmond -i bond0" in one of these nodes:
1 lo : 0 V2
010000E0 1 0:FB5C0BCB 0
2 eth0 : 1 V2
010000E0 1 0:FB5C0BCB 0
3 eth1 : 1 V2
010000E0 1 0:FB5C0BCB 0
4 bond0 : 2 V2
470B02EF 1 0:FEF49B60 0
010000E0 1 0:FB5C0BCB 0
note that ganglia has registered to bond0, correctly
I have another node that is connected to the cluster and to the
external network, as well. This one has 4 NICs, 2 of which (eth0 and
eth1) are bonded, eth2 goes to the external network and eth3 connects
to one of the switches and is used for booting the nodes (DHCP/TFTP).
This node basically gives NFS service and acts as a gateway to the
rest of the nodes. It also should be used to report ganglia monitored
metrics (this is, nifty statistics and plots about the status of the
clusters) to the external world, through HTTP.
It's this node that I'm having problems with. After starting 'gmond -i
bond0',/proc/net/igmp looks like this:
Idx Device : Count Querier Group Users Timer Reporter
1 lo : 0 V2
010000E0 1 0:FFFAA0D3 0
2 eth0 : 2 V2
470B02EF 1 0:FFFF6F8E 1
010000E0 1 0:FFFAA0D3 0
3 eth1 : 1 V2
010000E0 1 0:DFDA80B2 0
4 eth2 : 1 V2
010000E0 1 0:DFDA80B2 0
5 bond0 : 1 V2
010000E0 1 0:DFDA80B2 0
6 eth3 : 1 V2
010000E0 1 0:DFDA80B2 0
here we can see that ganglia has registered into eth0, instead of
bond0! As a result of this, ganglia in this node can't communicate
with ganglia in other nodes...
I'm using kernel 2.4.18 with bonding patch + bonding-multicast patch
by Mark Smith and the appropiate ifenslave. Output of ifconfig looks
like this:
bond0 Link encap:Ethernet HWaddr 00:04:E2:07:9A:F6
inet addr:192.168.128.1 Bcast:192.168.128.255
Mask:255.255.255.0 UP BROADCAST RUNNING MASTER MULTICAST
MTU:1500 Metric:1 RX packets:31687 errors:0 dropped:0
overruns:0 frame:0 TX packets:31943 errors:0 dropped:0
overruns:0 carrier:0 collisions:0 txqueuelen:0
RX bytes:4906926 (4.6 Mb) TX bytes:5296714 (5.0 Mb)
eth0 Link encap:Ethernet HWaddr 00:04:E2:07:9A:F6 Media:unknown
inet addr:192.168.128.1 Bcast:192.168.128.255
Mask:255.255.255.0 UP BROADCAST DEBUG RUNNING NOARP PROMISC
SLAVE DYNAMIC MTU:1500 Metric:1 RX packets:16122 errors:0
dropped:0 overruns:0 frame:0 TX packets:16303 errors:0
dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100
RX bytes:2470826 (2.3 Mb) TX bytes:2666706 (2.5 Mb)
Interrupt:11 Base address:0xa000
eth1 Link encap:Ethernet HWaddr 00:04:E2:07:9A:F6
inet addr:192.168.128.1 Bcast:192.168.128.255
Mask:255.255.255.0 UP BROADCAST RUNNING SLAVE MULTICAST
MTU:1500 Metric:1 RX packets:15565 errors:0 dropped:0
overruns:0 frame:0 TX packets:15640 errors:0 dropped:0
overruns:0 carrier:0 collisions:0 txqueuelen:100
RX bytes:2436100 (2.3 Mb) TX bytes:2630008 (2.5 Mb)
Interrupt:12 Base address:0xc000
eth2 Link encap:Ethernet HWaddr 00:04:E2:07:9B:46
inet addr:193.144.17.59 Bcast:193.144.17.255
Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500
Metric:1 RX packets:41000 errors:0 dropped:0 overruns:0
frame:0 TX packets:691 errors:0 dropped:0 overruns:0
carrier:0 collisions:0 txqueuelen:100
RX bytes:14338143 (13.6 Mb) TX bytes:80149 (78.2 Kb)
Interrupt:10 Base address:0xe000
eth3 Link encap:Ethernet HWaddr 00:04:75:7E:BF:49
inet addr:192.168.127.1 Bcast:192.168.127.255
Mask:255.255.255.0 UP BROADCAST RUNNING NOARP MULTICAST
MTU:1500 Metric:1 RX packets:569 errors:0 dropped:0
overruns:0 frame:0 TX packets:0 errors:0 dropped:0
overruns:0 carrier:0 collisions:0 txqueuelen:100
RX bytes:60840 (59.4 Kb) TX bytes:0 (0.0 b)
Interrupt:11 Base address:0xb800
any suggestions? has anyone experienced a similar problem?
kind regards,