I'm having a very strange problem trying to get a cluster running. I have a cluster of three nodes each running in their own VirtualBox Fedora 11 guest, within a Fedora 11 host.
Heartbeat appears to start just fine on all nodes, but none of them see each other. I have run Wireshark on all three machines, and see heartbeat packets arriving from all three of my nodes on their correct multicast address. It's just as if heartbeat is not listening where it's supposed to be. I installed heartbeat from yum as seen here: [r...@ct02 ha.d]# yum list heartbeat Loaded plugins: refresh-packagekit Installed Packages heartbeat.x86_64 2.1.4-6.fc11 @fedora My ha.cf is as follows: logfacility local0 deadtime 60 warntime 20 initdead 120 mcast eth0 225.0.0.1 694 1 0 auto_failback on node ct01.xyz.com node ct02.xyz.com node ct03.xyz.com my haresources is as follows: ct01.xyz.com 192.168.101.210 httpd smb As you can see from the following log, the other nodes are immediately seen as dead, and no traffic is ever reported for them. The logs are just about identical on all three nodes. Is this something to do with VirtualBox? I wouldn't think so because Wireshark has no problem observing the traffic. I feel like I must be missing something stupid here. Please help! Apr 14 16:13:34 ct01 heartbeat: [8077]: info: heartbeat: version 2.1.4 Apr 14 16:13:34 ct01 heartbeat: [8077]: info: Heartbeat generation: 1271278771 Apr 14 16:13:34 ct01 heartbeat: [8077]: info: glib: UDP multicast heartbeat started for group 225.0.0.1 port 694 interface eth0 (ttl=1 loop=0) Apr 14 16:13:34 ct01 heartbeat: [8077]: info: G_main_add_TriggerHandler: Added signal manual handler Apr 14 16:13:34 ct01 heartbeat: [8077]: info: G_main_add_TriggerHandler: Added signal manual handler Apr 14 16:13:34 ct01 heartbeat: [8077]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Apr 14 16:13:34 ct01 heartbeat: [8077]: info: Local status now set to: 'up' Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: node ct02.xyz.com: is dead Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: No STONITH device configured. Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: Shared disks are not protected. Apr 14 16:15:34 ct01 heartbeat: [8077]: info: Resources being acquired from ct02.xyz.com. Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: node ct03.xyz.com: is dead Apr 14 16:15:34 ct01 heartbeat: [8077]: info: Comm_now_up(): updating status to active Apr 14 16:15:34 ct01 heartbeat: [8077]: info: Local status now set to: 'active' Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: No STONITH device configured. Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: Shared disks are not protected. Apr 14 16:15:34 ct01 heartbeat: [8077]: info: Resources being acquired from ct03.xyz.com. Apr 14 16:15:34 ct01 harc[8084]: info: Running /etc/ha.d/rc.d/status status Apr 14 16:15:34 ct01 mach_down[8151]: info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired Apr 14 16:15:34 ct01 mach_down[8151]: info: mach_down takeover complete for node ct02.xyz.com. Apr 14 16:15:34 ct01 heartbeat: [8077]: info: mach_down takeover complete. Apr 14 16:15:34 ct01 heartbeat: [8077]: info: Initial resource acquisition complete (mach_down) Apr 14 16:15:34 ct01 IPaddr[8183]: INFO: Resource is stopped Apr 14 16:15:34 ct01 IPaddr[8186]: INFO: Resource is stopped Apr 14 16:15:34 ct01 heartbeat: [8086]: info: Local Resource acquisition completed. Apr 14 16:15:34 ct01 heartbeat: [8085]: info: Local Resource acquisition completed. Apr 14 16:15:34 ct01 harc[8303]: info: Running /etc/ha.d/rc.d/status status Apr 14 16:15:34 ct01 mach_down[8319]: info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired Apr 14 16:15:34 ct01 mach_down[8319]: info: mach_down takeover complete for node ct03.xyz.com. Apr 14 16:15:34 ct01 heartbeat: [8077]: info: mach_down takeover complete. Apr 14 16:15:34 ct01 harc[8353]: info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp Apr 14 16:15:34 ct01 ip-request-resp[8353]: received ip-request-resp 192.168.101.210/24 OK yes Apr 14 16:15:34 ct01 ResourceManager[8374]: info: Acquiring resource group: ct01.xyz.com 192.168.101.210/24 httpd smb Apr 14 16:15:34 ct01 IPaddr[8401]: INFO: Resource is stopped Apr 14 16:15:35 ct01 ResourceManager[8374]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.101.210/24 start Apr 14 16:15:35 ct01 IPaddr[8495]: INFO: Using calculated nic for 192.168.101.210: eth0 Apr 14 16:15:35 ct01 IPaddr[8495]: INFO: Using calculated netmask for 192.168.101.210: 255.255.255.0 Apr 14 16:15:35 ct01 IPaddr[8495]: INFO: eval ifconfig eth0:0 192.168.101.210 netmask 255.255.255.0 broadcast 192.168.101.255 Apr 14 16:15:35 ct01 avahi-daemon[1074]: Registering new address record for 192.168.101.210 on eth0.IPv4. Apr 14 16:15:35 ct01 IPaddr[8469]: INFO: Success Apr 14 16:15:35 ct01 ResourceManager[8374]: info: Running /etc/init.d/httpd start Apr 14 16:15:35 ct01 ResourceManager[8374]: info: Running /etc/init.d/smb start Apr 14 16:15:35 ct01 harc[8671]: info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp Apr 14 16:15:35 ct01 ip-request-resp[8671]: received ip-request-resp 192.168.101.210/24 OK yes Apr 14 16:15:35 ct01 ResourceManager[8692]: info: Acquiring resource group: ct01.xyz.com 192.168.101.210/24 httpd smb Apr 14 16:15:35 ct01 IPaddr[8719]: INFO: Running OK Apr 14 16:15:35 ct01 ResourceManager[8692]: info: Running /etc/init.d/httpd start Apr 14 16:15:36 ct01 ResourceManager[8692]: ERROR: Return code 1 from /etc/init.d/httpd Apr 14 16:15:36 ct01 ResourceManager[8692]: CRIT: Giving up resources due to failure of httpd Apr 14 16:15:36 ct01 ResourceManager[8692]: info: Releasing resource group: ct01.xyz.com 192.168.101.210/24 httpd smb Apr 14 16:15:36 ct01 ResourceManager[8692]: info: Running /etc/init.d/smb stop Apr 14 16:15:36 ct01 smbd[8666]: [2010/04/14 16:15:36, 0] smbd/server.c:457(smbd_open_one_socket) Apr 14 16:15:36 ct01 smbd[8666]: smbd_open_once_socket: open_socket_in: Address already in use Apr 14 16:15:36 ct01 smbd[8666]: [2010/04/14 16:15:36, 0] smbd/server.c:457(smbd_open_one_socket) Apr 14 16:15:36 ct01 smbd[8666]: smbd_open_once_socket: open_socket_in: Address already in use Apr 14 16:15:36 ct01 ntpd[1368]: Listening on interface #7 eth0:0, 192.168.101.210#123 Enabled Apr 14 16:15:37 ct01 ResourceManager[8692]: info: Running /etc/init.d/httpd stop Apr 14 16:15:37 ct01 ResourceManager[8692]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.101.210/24 stop Apr 14 16:15:37 ct01 IPaddr[8924]: INFO: ifconfig eth0:0 down Apr 14 16:15:37 ct01 avahi-daemon[1074]: Withdrawing address record for 192.168.101.210 on eth0. Apr 14 16:15:37 ct01 IPaddr[8898]: INFO: Success Apr 14 16:15:38 ct01 ntpd[1368]: Deleting interface #7 eth0:0, 192.168.101.210#123, interface stats: received=0, sent=0, dropped=0, active_time=2 secs Apr 14 16:15:45 ct01 heartbeat: [8077]: info: Local Resource acquisition completed. (none) Apr 14 16:15:45 ct01 heartbeat: [8077]: info: local resource transition completed. Apr 14 16:16:07 ct01 hb_standby[8969]: Going standby [foreign]. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
