[Linux-HA] Heartbeat doesn't see other nodes in cluster

Stephen Punak Wed, 14 Apr 2010 16:24:29 -0700

I'm having a very strange problem trying to get a cluster running. 

I have a cluster of three nodes each running in their own VirtualBox Fedora 11 
guest, 
within a Fedora 11 host.


Heartbeat appears to start just fine on all nodes, but none of them see each 
other. 

I have run Wireshark on all three machines, and see heartbeat packets arriving 
from all 
three of my nodes on their correct multicast address. It's just as if heartbeat 
is not 
listening where it's supposed to be.

I installed heartbeat from yum as seen here:

  [r...@ct02 ha.d]# yum list heartbeat
  Loaded plugins: refresh-packagekit
  Installed Packages
  heartbeat.x86_64               2.1.4-6.fc11       @fedora

My ha.cf is as follows:

  logfacility     local0
  deadtime 60
  warntime 20
  initdead 120
  mcast eth0 225.0.0.1 694 1 0
  auto_failback on
  node ct01.xyz.com
  node ct02.xyz.com
  node ct03.xyz.com

my haresources is as follows:
  
  ct01.xyz.com 192.168.101.210 httpd smb

As you can see from the following log, the other nodes are immediately seen as 
dead, 
and no traffic is ever reported for them. The logs are just about identical on 
all
three nodes.

Is this something to do with VirtualBox? I wouldn't think so because Wireshark 
has
no problem observing the traffic. I feel like I must be missing something 
stupid here.

Please help!

  
Apr 14 16:13:34 ct01 heartbeat: [8077]: info: heartbeat: version 2.1.4
Apr 14 16:13:34 ct01 heartbeat: [8077]: info: Heartbeat generation: 1271278771
Apr 14 16:13:34 ct01 heartbeat: [8077]: info: glib: UDP multicast heartbeat 
started for group 225.0.0.1 port 694 interface eth0 (ttl=1 loop=0)
Apr 14 16:13:34 ct01 heartbeat: [8077]: info: G_main_add_TriggerHandler: Added 
signal manual handler
Apr 14 16:13:34 ct01 heartbeat: [8077]: info: G_main_add_TriggerHandler: Added 
signal manual handler
Apr 14 16:13:34 ct01 heartbeat: [8077]: info: G_main_add_SignalHandler: Added 
signal handler for signal 17
Apr 14 16:13:34 ct01 heartbeat: [8077]: info: Local status now set to: 'up'
Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: node ct02.xyz.com: is dead
Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: No STONITH device configured.
Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: Shared disks are not protected.
Apr 14 16:15:34 ct01 heartbeat: [8077]: info: Resources being acquired from 
ct02.xyz.com.
Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: node ct03.xyz.com: is dead
Apr 14 16:15:34 ct01 heartbeat: [8077]: info: Comm_now_up(): updating status to 
active
Apr 14 16:15:34 ct01 heartbeat: [8077]: info: Local status now set to: 'active'
Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: No STONITH device configured.
Apr 14 16:15:34 ct01 heartbeat: [8077]: WARN: Shared disks are not protected.
Apr 14 16:15:34 ct01 heartbeat: [8077]: info: Resources being acquired from 
ct03.xyz.com.
Apr 14 16:15:34 ct01 harc[8084]: info: Running /etc/ha.d/rc.d/status status
Apr 14 16:15:34 ct01 mach_down[8151]: info: /usr/share/heartbeat/mach_down: 
nice_failback: foreign resources acquired
Apr 14 16:15:34 ct01 mach_down[8151]: info: mach_down takeover complete for 
node ct02.xyz.com.
Apr 14 16:15:34 ct01 heartbeat: [8077]: info: mach_down takeover complete.
Apr 14 16:15:34 ct01 heartbeat: [8077]: info: Initial resource acquisition 
complete (mach_down)
Apr 14 16:15:34 ct01 IPaddr[8183]: INFO:  Resource is stopped
Apr 14 16:15:34 ct01 IPaddr[8186]: INFO:  Resource is stopped
Apr 14 16:15:34 ct01 heartbeat: [8086]: info: Local Resource acquisition 
completed.
Apr 14 16:15:34 ct01 heartbeat: [8085]: info: Local Resource acquisition 
completed.
Apr 14 16:15:34 ct01 harc[8303]: info: Running /etc/ha.d/rc.d/status status
Apr 14 16:15:34 ct01 mach_down[8319]: info: /usr/share/heartbeat/mach_down: 
nice_failback: foreign resources acquired
Apr 14 16:15:34 ct01 mach_down[8319]: info: mach_down takeover complete for 
node ct03.xyz.com.
Apr 14 16:15:34 ct01 heartbeat: [8077]: info: mach_down takeover complete.
Apr 14 16:15:34 ct01 harc[8353]: info: Running /etc/ha.d/rc.d/ip-request-resp 
ip-request-resp
Apr 14 16:15:34 ct01 ip-request-resp[8353]: received ip-request-resp 
192.168.101.210/24 OK yes
Apr 14 16:15:34 ct01 ResourceManager[8374]: info: Acquiring resource group: 
ct01.xyz.com 192.168.101.210/24 httpd smb
Apr 14 16:15:34 ct01 IPaddr[8401]: INFO:  Resource is stopped
Apr 14 16:15:35 ct01 ResourceManager[8374]: info: Running 
/etc/ha.d/resource.d/IPaddr 192.168.101.210/24 start
Apr 14 16:15:35 ct01 IPaddr[8495]: INFO: Using calculated nic for 
192.168.101.210: eth0
Apr 14 16:15:35 ct01 IPaddr[8495]: INFO: Using calculated netmask for 
192.168.101.210: 255.255.255.0
Apr 14 16:15:35 ct01 IPaddr[8495]: INFO: eval ifconfig eth0:0 192.168.101.210 
netmask 255.255.255.0 broadcast 192.168.101.255
Apr 14 16:15:35 ct01 avahi-daemon[1074]: Registering new address record for 
192.168.101.210 on eth0.IPv4.
Apr 14 16:15:35 ct01 IPaddr[8469]: INFO:  Success
Apr 14 16:15:35 ct01 ResourceManager[8374]: info: Running /etc/init.d/httpd  
start
Apr 14 16:15:35 ct01 ResourceManager[8374]: info: Running /etc/init.d/smb  start
Apr 14 16:15:35 ct01 harc[8671]: info: Running /etc/ha.d/rc.d/ip-request-resp 
ip-request-resp
Apr 14 16:15:35 ct01 ip-request-resp[8671]: received ip-request-resp 
192.168.101.210/24 OK yes
Apr 14 16:15:35 ct01 ResourceManager[8692]: info: Acquiring resource group: 
ct01.xyz.com 192.168.101.210/24 httpd smb
Apr 14 16:15:35 ct01 IPaddr[8719]: INFO:  Running OK
Apr 14 16:15:35 ct01 ResourceManager[8692]: info: Running /etc/init.d/httpd  
start
Apr 14 16:15:36 ct01 ResourceManager[8692]: ERROR: Return code 1 from 
/etc/init.d/httpd
Apr 14 16:15:36 ct01 ResourceManager[8692]: CRIT: Giving up resources due to 
failure of httpd
Apr 14 16:15:36 ct01 ResourceManager[8692]: info: Releasing resource group: 
ct01.xyz.com 192.168.101.210/24 httpd smb
Apr 14 16:15:36 ct01 ResourceManager[8692]: info: Running /etc/init.d/smb  stop
Apr 14 16:15:36 ct01 smbd[8666]: [2010/04/14 16:15:36,  0] 
smbd/server.c:457(smbd_open_one_socket)
Apr 14 16:15:36 ct01 smbd[8666]:   smbd_open_once_socket: open_socket_in: 
Address already in use
Apr 14 16:15:36 ct01 smbd[8666]: [2010/04/14 16:15:36,  0] 
smbd/server.c:457(smbd_open_one_socket)
Apr 14 16:15:36 ct01 smbd[8666]:   smbd_open_once_socket: open_socket_in: 
Address already in use
Apr 14 16:15:36 ct01 ntpd[1368]: Listening on interface #7 eth0:0, 
192.168.101.210#123 Enabled
Apr 14 16:15:37 ct01 ResourceManager[8692]: info: Running /etc/init.d/httpd  
stop
Apr 14 16:15:37 ct01 ResourceManager[8692]: info: Running 
/etc/ha.d/resource.d/IPaddr 192.168.101.210/24 stop
Apr 14 16:15:37 ct01 IPaddr[8924]: INFO: ifconfig eth0:0 down
Apr 14 16:15:37 ct01 avahi-daemon[1074]: Withdrawing address record for 
192.168.101.210 on eth0.
Apr 14 16:15:37 ct01 IPaddr[8898]: INFO:  Success
Apr 14 16:15:38 ct01 ntpd[1368]: Deleting interface #7 eth0:0, 
192.168.101.210#123, interface stats: received=0, sent=0, dropped=0, 
active_time=2 secs
Apr 14 16:15:45 ct01 heartbeat: [8077]: info: Local Resource acquisition 
completed. (none)
Apr 14 16:15:45 ct01 heartbeat: [8077]: info: local resource transition 
completed.
Apr 14 16:16:07 ct01 hb_standby[8969]: Going standby [foreign].


 

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Heartbeat doesn't see other nodes in cluster

Reply via email to