Forgive me if this is a lengthy email, this is my first HA issue and I've included some logs at the end.
For the sake of privacy, I've used dummy IPs here. My master is 192.168.1.101, slave is 192.168.1.102, shared IP is 192.168.1.103 My servers at at Rackspace and the configurations were done according to this guide: http://cloudservers.rackspacecloud.com/index.php/IP_Failover_-_Setup_and_Installing_Heartbeat As of now I can replicate the following behavior (with heartbeat configured to start with the server automatically): 1.) I shut down the slave node completely (server name failover at 192.168.1.102) 2.) I then reboot the master node (server name app1 at 192.168.1.101) 3.) After the reboot is finished the master begins responding to requests on the shared IP 4.) I then start up the slave 5.) The master continues to respond to requests on shared IP 6.) I reboot the master 7.) The slave begins immediately responding to requests on the shared IP while master is rebooting (failover seems to work) 8.) The master finishes rebooting 9.) Requests / pings to the shared IP begin to time out, neither server responds 10.) I reboot the master again 11.) The slave begins responding to requests on the shared IP again while master is rebooting 12.) When the master finished rebooting, requests / pings to the shared IP begin to time out again I can repeat steps 10-12 over and over. Each time the master comes back online the shared IP stops responding completely. However, this happens only if heartbeat is set to start up with the server using "chkconfig heartbeat on". If I have "chkconfig heartbeat off", then when the master finishes rebooting, the slave continues responding to requests on the shared IP. It is not until I start heartbeat that the shared IP stops responding. However, a RESTART of heartbeat (either after starting it manually if chkconfig is off, or just restarting it after it has started with the server automatically if chkconfig is on) causes the master to begin responding to requests on the shared IP. So, the first time heartbeat starts, either with the server or if I do it manually, it causes the shared IP to stop responding. A RESTART of heartbeat after the initial start causes the master to begin responding to the IP again. Rebooting or shutting down the slave when the IP stops responding does not cause the master to start responding immediately. Instead, the shared IP continues to timeout. If I wait for a few minutes after shutting down the slave, eventually the master begins to respond again to the shared IP again. However, it does not begin to respond immediately the way the slave does when the master goes offline. Once the master begins responding again after I've shut down the slave, I can then restart the slave and the master continues to respond to requests until I reboot it, at which point the slave begins to respond. However, as soon as the master finishes rebooting, the shared IP becomes unresponsive again the first time heartbeat starts. Restarting heartbeat causes the IP to begin to respond again on the master. In contrast, shutting down the slave when the IP is not responding does not cause the master to start responding again. Requests continue to timeout until the slave is restarted and starts responding again or until I wait long enough for the master to begin responding again while the slave is shut down. I'm wondering, why is there such a delay for the master to begin responding again when the slave goes offline? It seems the only way to give the master control again after a failover to the slave is to shut down the slave completely, reboot the master and... wait (or reboot heartbeat). Rebooting the master after shutting down the slave completely still does not cause the master to start responding to requests upon reboot. I still have to just wait until it begins to respond again. So, it seems that the issue is that when the master starts heartbeat the first time something happens that makes the shared IP stop responding. Stopping heartbeat on the master causes the slave to takeover and then restarting heartbeat on the master gives control back to the master. I also wanted to include what the logs show when I perform a heartbeat restart on the master after a reboot. If I put chkconfig heartbeat off and reboot the master... the following is added to /var/log/ha-log on master as the server reboots: heartbeat[3259]: 2010/08/06_01:31:41 info: Heartbeat shutdown in progress. (3259) heartbeat[3640]: 2010/08/06_01:31:41 info: Giving up all HA resources. ResourceManager[3653]: 2010/08/06_01:31:41 info: Releasing resource group: app1 192.168.1.103/24 ResourceManager[3653]: 2010/08/06_01:31:41 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.103/24 stop IPaddr[3719]: 2010/08/06_01:31:41 INFO: ifconfig eth0:0 down IPaddr[3693]: 2010/08/06_01:31:41 INFO: Success heartbeat[3640]: 2010/08/06_01:31:41 info: All HA resources relinquished. heartbeat[3259]: 2010/08/06_01:31:42 WARN: 1 lost packet(s) for [failover] [7412:7414] heartbeat[3259]: 2010/08/06_01:31:42 info: No pkts missing from failover! heartbeat[3259]: 2010/08/06_01:31:42 info: killing /usr/lib64/heartbeat/ipfail process group 3282 with signal 15 heartbeat[3259]: 2010/08/06_01:31:43 info: killing HBFIFO process 3261 with signal 15 heartbeat[3259]: 2010/08/06_01:31:43 info: killing HBWRITE process 3262 with signal 15 heartbeat[3259]: 2010/08/06_01:31:43 info: killing HBREAD process 3263 with signal 15 heartbeat[3259]: 2010/08/06_01:31:43 info: Core process 3261 exited. 3 remaining heartbeat[3259]: 2010/08/06_01:31:43 info: Core process 3262 exited. 2 remaining heartbeat[3259]: 2010/08/06_01:31:43 info: Core process 3263 exited. 1 remaining heartbeat[3259]: 2010/08/06_01:31:43 info: app1 Heartbeat shutdown complete. This is added to /var/log/ha-log on slave: heartbeat[2682]: 2010/08/06_01:31:58 info: Received shutdown notice from 'app1'. heartbeat[2682]: 2010/08/06_01:31:58 info: Resources being acquired from app1. heartbeat[9783]: 2010/08/06_01:31:58 info: acquire local HA resources (standby). heartbeat[9783]: 2010/08/06_01:31:58 info: local HA resource acquisition completed (standby). heartbeat[2682]: 2010/08/06_01:31:58 info: Standby resource acquisition done [foreign]. heartbeat[9784]: 2010/08/06_01:31:58 info: No local resources [/usr/share/heartbeat/ResourceManager listkeys failover] to acquire. harc[9809]: 2010/08/06_01:31:58 info: Running /etc/ha.d/rc.d/status status mach_down[9825]: 2010/08/06_01:31:58 info: Taking over resource group 192.168.1.103/24 ResourceManager[9851]: 2010/08/06_01:31:58 info: Acquiring resource group: app1 192.168.1.103/24 IPaddr[9878]: 2010/08/06_01:31:58 INFO: Resource is stopped ResourceManager[9851]: 2010/08/06_01:31:58 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.103/24 start IPaddr[9972]: 2010/08/06_01:31:58 INFO: Using calculated nic for 192.168.1.103: eth0 IPaddr[9972]: 2010/08/06_01:31:58 INFO: Using calculated netmask for 192.168.1.103: 255.255.255.0 IPaddr[9972]: 2010/08/06_01:31:59 INFO: eval ifconfig eth0:0 192.168.1.103 netmask 255.255.255.0 broadcast 192.168.1.255 IPaddr[9946]: 2010/08/06_01:31:59 INFO: Success mach_down[9825]: 2010/08/06_01:31:59 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down[9825]: 2010/08/06_01:31:59 info: mach_down takeover complete for node app1. heartbeat[2682]: 2010/08/06_01:31:59 info: mach_down takeover complete. heartbeat[2682]: 2010/08/06_01:32:15 WARN: node app1: is dead heartbeat[2682]: 2010/08/06_01:32:15 info: Dead node app1 gave up resources. ipfail[2747]: 2010/08/06_01:32:15 info: Status update: Node app1 now has status dead heartbeat[2682]: 2010/08/06_01:32:15 info: Link app1:eth1 dead. ipfail[2747]: 2010/08/06_01:32:15 info: NS: We are dead. :< ipfail[2747]: 2010/08/06_01:32:16 info: Link Status update: Link app1/eth1 now has status dead ipfail[2747]: 2010/08/06_01:32:17 info: We are dead. :< ipfail[2747]: 2010/08/06_01:32:17 info: Asking other side for ping node count. When I try the first start of heartbeat on master after reboot (chkconfig for heartbeat off), the following is shown: Starting High-Availability services: 2010/08/06_01:36:05 INFO: Running OK 2010/08/06_01:36:05 CRITICAL: Resource 192.168.1.103/24 is active, and should not be! 2010/08/06_01:36:05 CRITICAL: Non-idle resources can affect data integrity! 2010/08/06_01:36:05 info: If you don't know what this means, then get help! 2010/08/06_01:36:05 info: Read the docs and/or source to /usr/share/heartbeat/ResourceManager for more details. CRITICAL: Resource 192.168.1.103/24 is active, and should not be! CRITICAL: Non-idle resources can affect data integrity! info: If you don't know what this means, then get help! info: Read the docs and/or the source to /usr/share/heartbeat/ResourceManager for more details. 2010/08/06_01:36:05 CRITICAL: Non-idle resources will affect resource takeback! 2010/08/06_01:36:05 CRITICAL: Non-idle resources may affect data integrity! [ OK ] And the following is added to /var/log/ha-log on master: heartbeat[2879]: 2010/08/06_01:36:05 WARN: Logging daemon is disabled --enabling logging daemon is recommended heartbeat[2879]: 2010/08/06_01:36:05 info: Version 2 support: false heartbeat[2879]: 2010/08/06_01:36:05 WARN: Logging daemon is disabled --enabling logging daemon is recommended heartbeat[2879]: 2010/08/06_01:36:05 info: ************************** heartbeat[2879]: 2010/08/06_01:36:05 info: Configuration validated. Starting heartbeat 2.1.3 heartbeat[2880]: 2010/08/06_01:36:05 info: heartbeat: version 2.1.3 heartbeat[2880]: 2010/08/06_01:36:05 info: Heartbeat generation: 1280995153 heartbeat[2880]: 2010/08/06_01:36:05 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 heartbeat[2880]: 2010/08/06_01:36:05 info: glib: ucast: bound send socket to device: eth1 heartbeat[2880]: 2010/08/06_01:36:05 info: glib: ucast: bound receive socket to device: eth1 heartbeat[2880]: 2010/08/06_01:36:05 info: glib: ucast: started on port 694 interface eth1 to 10.179.80.55 heartbeat[2880]: 2010/08/06_01:36:05 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[2880]: 2010/08/06_01:36:05 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[2880]: 2010/08/06_01:36:05 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[2880]: 2010/08/06_01:36:05 info: Local status now set to: 'up' heartbeat[2880]: 2010/08/06_01:36:06 info: Link failover:eth1 up. heartbeat[2880]: 2010/08/06_01:36:06 info: Comm_now_up(): updating status to active heartbeat[2880]: 2010/08/06_01:36:06 info: Local status now set to: 'active' heartbeat[2880]: 2010/08/06_01:36:06 info: Starting child client "/usr/lib64/heartbeat/ipfail" (498,496) heartbeat[2888]: 2010/08/06_01:36:06 info: Starting "/usr/lib64/heartbeat/ipfail" as uid 498 gid 496 (pid 2888) heartbeat[2880]: 2010/08/06_01:36:07 info: Status update for node failover: status active heartbeat[2880]: 2010/08/06_01:36:07 info: remote resource transition completed. heartbeat[2880]: 2010/08/06_01:36:07 info: remote resource transition completed. heartbeat[2880]: 2010/08/06_01:36:07 info: Local Resource acquisition completed. (none) harc[2891]: 2010/08/06_01:36:07 info: Running /etc/ha.d/rc.d/status status heartbeat[2880]: 2010/08/06_01:36:07 info: failover wants to go standby [foreign] heartbeat[2880]: 2010/08/06_01:36:08 info: standby: acquire [foreign] resources from failover heartbeat[2907]: 2010/08/06_01:36:08 info: acquire local HA resources (standby). ResourceManager[2920]: 2010/08/06_01:36:08 info: Acquiring resource group: app1 192.168.1.103/24 IPaddr[2947]: 2010/08/06_01:36:09 INFO: Running OK heartbeat[2907]: 2010/08/06_01:36:09 info: local HA resource acquisition completed (standby). heartbeat[2880]: 2010/08/06_01:36:09 info: Standby resource acquisition done [foreign]. heartbeat[2880]: 2010/08/06_01:36:09 info: Initial resource acquisition complete (auto_failback) heartbeat[2880]: 2010/08/06_01:36:09 info: remote resource transition completed. ipfail[2888]: 2010/08/06_01:36:10 info: Status update: Node failover now has status active ipfail[2888]: 2010/08/06_01:36:14 info: Ping node count is balanced. ipfail[2888]: 2010/08/06_01:36:14 info: Giving up foreign resources (auto_failback). ipfail[2888]: 2010/08/06_01:36:14 info: Delayed giveup in 4 seconds. ipfail[2888]: 2010/08/06_01:36:18 info: giveup() called (timeout worked) heartbeat[2880]: 2010/08/06_01:36:19 info: app1 wants to go standby [foreign] heartbeat[2880]: 2010/08/06_01:36:19 info: standby: failover can take our foreign resources heartbeat[2992]: 2010/08/06_01:36:19 info: give up foreign HA resources (standby). heartbeat[2992]: 2010/08/06_01:36:19 info: foreign HA resource release completed (standby). heartbeat[2880]: 2010/08/06_01:36:19 info: Local standby process completed [foreign]. heartbeat[2880]: 2010/08/06_01:36:20 WARN: 1 lost packet(s) for [failover] [7566:7568] heartbeat[2880]: 2010/08/06_01:36:20 info: remote resource transition completed. heartbeat[2880]: 2010/08/06_01:36:20 info: No pkts missing from failover! heartbeat[2880]: 2010/08/06_01:36:20 info: Other node completed standby takeover of foreign resources. The following is added to /var/log/ha-log on slave: heartbeat[2682]: 2010/08/06_01:36:22 info: Heartbeat restart on node app1 heartbeat[2682]: 2010/08/06_01:36:22 info: Link app1:eth1 up. heartbeat[2682]: 2010/08/06_01:36:22 info: Status update for node app1: status init ipfail[2747]: 2010/08/06_01:36:22 info: Link Status update: Link app1/eth1 now has status up heartbeat[2682]: 2010/08/06_01:36:22 info: Status update for node app1: status up ipfail[2747]: 2010/08/06_01:36:22 info: Status update: Node app1 now has status init ipfail[2747]: 2010/08/06_01:36:22 info: Status update: Node app1 now has status up harc[10090]: 2010/08/06_01:36:22 info: Running /etc/ha.d/rc.d/status status harc[10106]: 2010/08/06_01:36:22 info: Running /etc/ha.d/rc.d/status status heartbeat[2682]: 2010/08/06_01:36:23 info: Status update for node app1: status active ipfail[2747]: 2010/08/06_01:36:23 info: Status update: Node app1 now has status active harc[10122]: 2010/08/06_01:36:23 info: Running /etc/ha.d/rc.d/status status heartbeat[2682]: 2010/08/06_01:36:23 info: remote resource transition completed. heartbeat[2682]: 2010/08/06_01:36:23 info: failover wants to go standby [foreign] heartbeat[2682]: 2010/08/06_01:36:24 info: standby: app1 can take our foreign resources heartbeat[10138]: 2010/08/06_01:36:24 info: give up foreign HA resources (standby). ipfail[2747]: 2010/08/06_01:36:24 info: Asking other side for ping node count. ResourceManager[10151]: 2010/08/06_01:36:24 info: Releasing resource group: app1 192.168.1.103/24 ResourceManager[10151]: 2010/08/06_01:36:24 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.103/24 stop IPaddr[10217]: 2010/08/06_01:36:24 INFO: ifconfig eth0:0 down IPaddr[10191]: 2010/08/06_01:36:24 INFO: Success heartbeat[10138]: 2010/08/06_01:36:24 info: foreign HA resource release completed (standby). heartbeat[2682]: 2010/08/06_01:36:24 info: Local standby process completed [foreign]. heartbeat[2682]: 2010/08/06_01:36:25 WARN: 1 lost packet(s) for [app1] [11:13] heartbeat[2682]: 2010/08/06_01:36:25 info: remote resource transition completed. heartbeat[2682]: 2010/08/06_01:36:25 info: No pkts missing from app1! heartbeat[2682]: 2010/08/06_01:36:25 info: Other node completed standby takeover of foreign resources. ipfail[2747]: 2010/08/06_01:36:31 info: No giveup timer to abort. heartbeat[2682]: 2010/08/06_01:36:36 info: app1 wants to go standby [foreign] heartbeat[2682]: 2010/08/06_01:36:36 info: standby: acquire [foreign] resources from app1 heartbeat[10247]: 2010/08/06_01:36:36 info: acquire local HA resources (standby). heartbeat[10247]: 2010/08/06_01:36:36 info: local HA resource acquisition completed (standby). heartbeat[2682]: 2010/08/06_01:36:36 info: Standby resource acquisition done [foreign]. heartbeat[2682]: 2010/08/06_01:36:37 info: remote resource transition completed. When RESTARTING heartbeat on the master, the master takes over and begins responding to requests. The following is added to the /var/log/ha-log on the master: heartbeat[2880]: 2010/08/06_01:36:20 info: Other node completed standby takeover of foreign resources. heartbeat[2880]: 2010/08/06_01:40:19 info: Heartbeat shutdown in progress. (2880) heartbeat[3052]: 2010/08/06_01:40:19 info: Giving up all HA resources. ResourceManager[3065]: 2010/08/06_01:40:19 info: Releasing resource group: app1 192.168.1.103/24 ResourceManager[3065]: 2010/08/06_01:40:19 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.103/24 stop IPaddr[3131]: 2010/08/06_01:40:19 INFO: ifconfig eth0:1 down IPaddr[3105]: 2010/08/06_01:40:19 INFO: Success heartbeat[3052]: 2010/08/06_01:40:19 info: All HA resources relinquished. heartbeat[2880]: 2010/08/06_01:40:20 WARN: 1 lost packet(s) for [failover] [7688:7690] heartbeat[2880]: 2010/08/06_01:40:20 info: No pkts missing from failover! heartbeat[2880]: 2010/08/06_01:40:20 info: killing /usr/lib64/heartbeat/ipfail process group 2888 with signal 15 heartbeat[2880]: 2010/08/06_01:40:22 info: killing HBFIFO process 2883 with signal 15 heartbeat[2880]: 2010/08/06_01:40:22 info: killing HBWRITE process 2884 with signal 15 heartbeat[2880]: 2010/08/06_01:40:22 info: killing HBREAD process 2885 with signal 15 heartbeat[2880]: 2010/08/06_01:40:22 info: Core process 2884 exited. 3 remaining heartbeat[2880]: 2010/08/06_01:40:22 info: Core process 2883 exited. 2 remaining heartbeat[2880]: 2010/08/06_01:40:22 info: Core process 2885 exited. 1 remaining heartbeat[2880]: 2010/08/06_01:40:22 info: app1 Heartbeat shutdown complete. heartbeat[3250]: 2010/08/06_01:40:48 WARN: Logging daemon is disabled --enabling logging daemon is recommended heartbeat[3250]: 2010/08/06_01:40:48 info: Version 2 support: false heartbeat[3250]: 2010/08/06_01:40:48 WARN: Logging daemon is disabled --enabling logging daemon is recommended heartbeat[3250]: 2010/08/06_01:40:48 info: ************************** heartbeat[3250]: 2010/08/06_01:40:48 info: Configuration validated. Starting heartbeat 2.1.3 heartbeat[3251]: 2010/08/06_01:40:48 info: heartbeat: version 2.1.3 heartbeat[3251]: 2010/08/06_01:40:48 info: Heartbeat generation: 1280995154 heartbeat[3251]: 2010/08/06_01:40:48 info: glib: ucast: write socket priority set to IPTOS_LOWDELAY on eth1 heartbeat[3251]: 2010/08/06_01:40:48 info: glib: ucast: bound send socket to device: eth1 heartbeat[3251]: 2010/08/06_01:40:48 info: glib: ucast: bound receive socket to device: eth1 heartbeat[3251]: 2010/08/06_01:40:48 info: glib: ucast: started on port 694 interface eth1 to 10.179.80.55 heartbeat[3251]: 2010/08/06_01:40:48 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[3251]: 2010/08/06_01:40:48 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[3251]: 2010/08/06_01:40:48 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[3251]: 2010/08/06_01:40:48 info: Local status now set to: 'up' heartbeat[3251]: 2010/08/06_01:40:49 info: Link failover:eth1 up. heartbeat[3251]: 2010/08/06_01:40:49 info: Comm_now_up(): updating status to active heartbeat[3251]: 2010/08/06_01:40:49 info: Local status now set to: 'active' heartbeat[3251]: 2010/08/06_01:40:49 info: Starting child client "/usr/lib64/heartbeat/ipfail" (498,496) heartbeat[3258]: 2010/08/06_01:40:49 info: Starting "/usr/lib64/heartbeat/ipfail" as uid 498 gid 496 (pid 3258) heartbeat[3251]: 2010/08/06_01:40:50 info: remote resource transition completed. heartbeat[3251]: 2010/08/06_01:40:50 info: remote resource transition completed. heartbeat[3251]: 2010/08/06_01:40:50 info: Local Resource acquisition completed. (none) heartbeat[3251]: 2010/08/06_01:40:50 info: Status update for node failover: status active heartbeat[3251]: 2010/08/06_01:40:50 info: failover wants to go standby [foreign] harc[3261]: 2010/08/06_01:40:50 info: Running /etc/ha.d/rc.d/status status heartbeat[3251]: 2010/08/06_01:40:51 info: standby: acquire [foreign] resources from failover heartbeat[3277]: 2010/08/06_01:40:51 info: acquire local HA resources (standby). ResourceManager[3290]: 2010/08/06_01:40:51 info: Acquiring resource group: app1 192.168.1.103/24 IPaddr[3317]: 2010/08/06_01:40:51 INFO: Resource is stopped ResourceManager[3290]: 2010/08/06_01:40:51 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.103/24 start IPaddr[3411]: 2010/08/06_01:40:52 INFO: Using calculated nic for 192.168.1.103: eth0 IPaddr[3411]: 2010/08/06_01:40:52 INFO: Using calculated netmask for 192.168.1.103: 255.255.255.0 IPaddr[3411]: 2010/08/06_01:40:52 INFO: eval ifconfig eth0:0 192.168.1.103 netmask 255.255.255.0 broadcast 192.168.1.255 IPaddr[3385]: 2010/08/06_01:40:52 INFO: Success heartbeat[3277]: 2010/08/06_01:40:52 info: local HA resource acquisition completed (standby). heartbeat[3251]: 2010/08/06_01:40:52 info: Standby resource acquisition done [foreign]. heartbeat[3251]: 2010/08/06_01:40:52 info: Initial resource acquisition complete (auto_failback) heartbeat[3251]: 2010/08/06_01:40:52 info: remote resource transition completed. ipfail[3258]: 2010/08/06_01:40:53 info: Status update: Node failover now has status active ipfail[3258]: 2010/08/06_01:40:56 info: Ping node count is balanced. ipfail[3258]: 2010/08/06_01:40:56 info: Giving up foreign resources (auto_failback). ipfail[3258]: 2010/08/06_01:40:56 info: Delayed giveup in 4 seconds. ipfail[3258]: 2010/08/06_01:41:00 info: giveup() called (timeout worked) heartbeat[3251]: 2010/08/06_01:41:01 info: app1 wants to go standby [foreign] heartbeat[3251]: 2010/08/06_01:41:01 info: standby: failover can take our foreign resources heartbeat[3512]: 2010/08/06_01:41:01 info: give up foreign HA resources (standby). heartbeat[3512]: 2010/08/06_01:41:01 info: foreign HA resource release completed (standby). heartbeat[3251]: 2010/08/06_01:41:01 info: Local standby process completed [foreign]. heartbeat[3251]: 2010/08/06_01:41:02 WARN: 1 lost packet(s) for [failover] [7725:7727] heartbeat[3251]: 2010/08/06_01:41:02 info: remote resource transition completed. heartbeat[3251]: 2010/08/06_01:41:02 info: No pkts missing from failover! heartbeat[3251]: 2010/08/06_01:41:02 info: Other node completed standby takeover of foreign resources. The /var/log/ha-log on the slave gets: heartbeat[2682]: 2010/08/06_01:40:36 info: Received shutdown notice from 'app1'. heartbeat[2682]: 2010/08/06_01:40:36 info: Resources being acquired from app1. heartbeat[10278]: 2010/08/06_01:40:36 info: acquire local HA resources (standby). heartbeat[10278]: 2010/08/06_01:40:36 info: local HA resource acquisition completed (standby). heartbeat[10279]: 2010/08/06_01:40:36 info: No local resources [/usr/share/heartbeat/ResourceManager listkeys failover] to acquire. heartbeat[2682]: 2010/08/06_01:40:36 info: Standby resource acquisition done [foreign]. harc[10304]: 2010/08/06_01:40:36 info: Running /etc/ha.d/rc.d/status status mach_down[10320]: 2010/08/06_01:40:36 info: Taking over resource group 192.168.1.103/24 ResourceManager[10346]: 2010/08/06_01:40:36 info: Acquiring resource group: app1 192.168.1.103/24 IPaddr[10373]: 2010/08/06_01:40:36 INFO: Resource is stopped ResourceManager[10346]: 2010/08/06_01:40:36 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.103/24 start IPaddr[10467]: 2010/08/06_01:40:37 INFO: Using calculated nic for 192.168.1.103: eth0 IPaddr[10467]: 2010/08/06_01:40:37 INFO: Using calculated netmask for 192.168.1.103: 255.255.255.0 IPaddr[10467]: 2010/08/06_01:40:37 INFO: eval ifconfig eth0:0 192.168.1.103 netmask 255.255.255.0 broadcast 192.168.1.255 IPaddr[10441]: 2010/08/06_01:40:37 INFO: Success mach_down[10320]: 2010/08/06_01:40:37 info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired mach_down[10320]: 2010/08/06_01:40:37 info: mach_down takeover complete for node app1. heartbeat[2682]: 2010/08/06_01:40:37 info: mach_down takeover complete. heartbeat[2682]: 2010/08/06_01:40:53 WARN: node app1: is dead heartbeat[2682]: 2010/08/06_01:40:53 info: Dead node app1 gave up resources. ipfail[2747]: 2010/08/06_01:40:53 info: Status update: Node app1 now has status dead heartbeat[2682]: 2010/08/06_01:40:54 info: Link app1:eth1 dead. ipfail[2747]: 2010/08/06_01:40:54 info: NS: We are dead. :< ipfail[2747]: 2010/08/06_01:40:54 info: Link Status update: Link app1/eth1 now has status dead ipfail[2747]: 2010/08/06_01:40:55 info: We are dead. :< ipfail[2747]: 2010/08/06_01:40:55 info: Asking other side for ping node count. heartbeat[2682]: 2010/08/06_01:41:05 info: Heartbeat restart on node app1 heartbeat[2682]: 2010/08/06_01:41:05 info: Link app1:eth1 up. heartbeat[2682]: 2010/08/06_01:41:05 info: Status update for node app1: status init ipfail[2747]: 2010/08/06_01:41:05 info: Link Status update: Link app1/eth1 now has status up heartbeat[2682]: 2010/08/06_01:41:05 info: Status update for node app1: status up ipfail[2747]: 2010/08/06_01:41:05 info: Status update: Node app1 now has status init ipfail[2747]: 2010/08/06_01:41:05 info: Status update: Node app1 now has status up harc[10580]: 2010/08/06_01:41:05 info: Running /etc/ha.d/rc.d/status status harc[10596]: 2010/08/06_01:41:05 info: Running /etc/ha.d/rc.d/status status heartbeat[2682]: 2010/08/06_01:41:06 info: Status update for node app1: status active ipfail[2747]: 2010/08/06_01:41:06 info: Status update: Node app1 now has status active harc[10612]: 2010/08/06_01:41:06 info: Running /etc/ha.d/rc.d/status status heartbeat[2682]: 2010/08/06_01:41:06 info: remote resource transition completed. heartbeat[2682]: 2010/08/06_01:41:06 info: failover wants to go standby [foreign] ipfail[2747]: 2010/08/06_01:41:06 info: Asking other side for ping node count. heartbeat[2682]: 2010/08/06_01:41:07 info: standby: app1 can take our foreign resources heartbeat[10628]: 2010/08/06_01:41:07 info: give up foreign HA resources (standby). ResourceManager[10641]: 2010/08/06_01:41:07 info: Releasing resource group: app1 192.168.1.103/24 ResourceManager[10641]: 2010/08/06_01:41:07 info: Running /etc/ha.d/resource.d/IPaddr 192.168.1.103/24 stop IPaddr[10707]: 2010/08/06_01:41:07 INFO: ifconfig eth0:0 down IPaddr[10681]: 2010/08/06_01:41:07 INFO: Success heartbeat[10628]: 2010/08/06_01:41:07 info: foreign HA resource release completed (standby). heartbeat[2682]: 2010/08/06_01:41:07 info: Local standby process completed [foreign]. heartbeat[2682]: 2010/08/06_01:41:08 WARN: 1 lost packet(s) for [app1] [11:13] heartbeat[2682]: 2010/08/06_01:41:08 info: remote resource transition completed. heartbeat[2682]: 2010/08/06_01:41:08 info: No pkts missing from app1! heartbeat[2682]: 2010/08/06_01:41:08 info: Other node completed standby takeover of foreign resources. ipfail[2747]: 2010/08/06_01:41:13 info: No giveup timer to abort. heartbeat[2682]: 2010/08/06_01:41:18 info: app1 wants to go standby [foreign] heartbeat[2682]: 2010/08/06_01:41:18 info: standby: acquire [foreign] resources from app1 heartbeat[10737]: 2010/08/06_01:41:18 info: acquire local HA resources (standby). heartbeat[10737]: 2010/08/06_01:41:18 info: local HA resource acquisition completed (standby). heartbeat[2682]: 2010/08/06_01:41:18 info: Standby resource acquisition done [foreign]. heartbeat[2682]: 2010/08/06_01:41:19 info: remote resource transition completed. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
