Re: [Linux-HA] heartbeat startup causes shared IP to stop responding

Jim Fri, 06 Aug 2010 02:25:45 -0700

Forgive me if this is a lengthy email, this is my first HA issue and I've
included some logs at the end.


For the sake of privacy, I've used dummy IPs here. My master is
192.168.1.101, slave is 192.168.1.102, shared IP is 192.168.1.103

My servers at at Rackspace and the configurations were done according to
this guide:

http://cloudservers.rackspacecloud.com/index.php/IP_Failover_-_Setup_and_Installing_Heartbeat

As of now I can replicate the following behavior (with heartbeat configured
to start with the server automatically):

1.) I shut down the slave node completely (server name failover at
192.168.1.102)
2.) I then reboot the master node (server name app1 at 192.168.1.101)
3.) After the reboot is finished the master begins responding to requests on
the shared IP
4.) I then start up the slave
5.) The master continues to respond to requests on shared IP
6.) I reboot the master
7.) The slave begins immediately responding to requests on the shared IP
while master is rebooting (failover seems to work)
8.) The master finishes rebooting
9.) Requests / pings to the shared IP begin to time out, neither server
responds
10.) I reboot the master again
11.) The slave begins responding to requests on the shared IP again while
master is rebooting
12.) When the master finished rebooting, requests / pings to the shared IP
begin to time out again

I can repeat steps 10-12 over and over. Each time the master comes back
online the shared IP stops responding completely. However, this happens only
if heartbeat is set to start up with the server using "chkconfig heartbeat
on". If I have "chkconfig heartbeat off", then when the master finishes
rebooting, the slave continues responding to requests on the shared IP. It
is not until I start heartbeat that the shared IP stops responding. However,
a RESTART of heartbeat (either after starting it manually if chkconfig is
off, or just restarting it after it has started with the server
automatically if chkconfig is on) causes the master to begin responding to
requests on the shared IP. So, the first time heartbeat starts, either with
the server or if I do it manually, it causes the shared IP to stop
responding. A RESTART of heartbeat after the initial start causes the master
to begin responding to the IP again.

Rebooting or shutting down the slave when the IP stops responding does not
cause the master to start responding immediately. Instead, the shared IP
continues to timeout. If I wait for a few minutes after shutting down the
slave, eventually the master begins to respond again to the shared IP again.
However, it does not begin to respond immediately the way the slave does
when the master goes offline. Once the master begins responding again after
I've shut down the slave, I can then restart the slave and the master
continues to respond to requests until I reboot it, at which point the slave
begins to respond. However, as soon as the master finishes rebooting, the
shared IP becomes unresponsive again the first time heartbeat starts.
Restarting heartbeat causes the IP to begin to respond again on the master.

In contrast, shutting down the slave when the IP is not responding does not
cause the master to start responding again. Requests continue to timeout
until the slave is restarted and starts responding again or until I wait
long enough for the master to begin responding again while the slave is shut
down. I'm wondering, why is there such a delay for the master to begin
responding again when the slave goes offline? It seems the only way to give
the master control again after a failover to the slave is to shut down the
slave completely, reboot the master and... wait (or reboot heartbeat).
Rebooting the master after shutting down the slave completely still does not
cause the master to start responding to requests upon reboot. I still have
to just wait until it begins to respond again.

So, it seems that the issue is that when the master starts heartbeat the
first time something happens that makes the shared IP stop responding.
Stopping heartbeat on the master causes the slave to takeover and then
restarting heartbeat on the master gives control back to the master.

I also wanted to include what the logs show when I perform a heartbeat
restart on the master after a reboot.

If I put chkconfig heartbeat off and reboot the master... the following is
added to /var/log/ha-log on master as the server reboots:

heartbeat[3259]: 2010/08/06_01:31:41 info: Heartbeat shutdown in progress.
(3259)
heartbeat[3640]: 2010/08/06_01:31:41 info: Giving up all HA resources.
ResourceManager[3653]:  2010/08/06_01:31:41 info: Releasing resource group:
app1 192.168.1.103/24
ResourceManager[3653]:  2010/08/06_01:31:41 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.1.103/24 stop
IPaddr[3719]:   2010/08/06_01:31:41 INFO: ifconfig eth0:0 down
IPaddr[3693]:   2010/08/06_01:31:41 INFO:  Success
heartbeat[3640]: 2010/08/06_01:31:41 info: All HA resources relinquished.
heartbeat[3259]: 2010/08/06_01:31:42 WARN: 1 lost packet(s) for [failover]
[7412:7414]
heartbeat[3259]: 2010/08/06_01:31:42 info: No pkts missing from failover!
heartbeat[3259]: 2010/08/06_01:31:42 info: killing
/usr/lib64/heartbeat/ipfail process group 3282 with signal 15
heartbeat[3259]: 2010/08/06_01:31:43 info: killing HBFIFO process 3261 with
signal 15
heartbeat[3259]: 2010/08/06_01:31:43 info: killing HBWRITE process 3262 with
signal 15
heartbeat[3259]: 2010/08/06_01:31:43 info: killing HBREAD process 3263 with
signal 15
heartbeat[3259]: 2010/08/06_01:31:43 info: Core process 3261 exited. 3
remaining
heartbeat[3259]: 2010/08/06_01:31:43 info: Core process 3262 exited. 2
remaining
heartbeat[3259]: 2010/08/06_01:31:43 info: Core process 3263 exited. 1
remaining
heartbeat[3259]: 2010/08/06_01:31:43 info: app1 Heartbeat shutdown complete.

This is added to /var/log/ha-log on slave:

heartbeat[2682]: 2010/08/06_01:31:58 info: Received shutdown notice from
'app1'.
heartbeat[2682]: 2010/08/06_01:31:58 info: Resources being acquired from
app1.
heartbeat[9783]: 2010/08/06_01:31:58 info: acquire local HA resources
(standby).
heartbeat[9783]: 2010/08/06_01:31:58 info: local HA resource acquisition
completed (standby).
heartbeat[2682]: 2010/08/06_01:31:58 info: Standby resource acquisition done
[foreign].
heartbeat[9784]: 2010/08/06_01:31:58 info: No local resources
[/usr/share/heartbeat/ResourceManager listkeys failover] to acquire.
harc[9809]:     2010/08/06_01:31:58 info: Running /etc/ha.d/rc.d/status
status
mach_down[9825]:        2010/08/06_01:31:58 info: Taking over resource group
192.168.1.103/24
ResourceManager[9851]:  2010/08/06_01:31:58 info: Acquiring resource group:
app1 192.168.1.103/24
IPaddr[9878]:   2010/08/06_01:31:58 INFO:  Resource is stopped
ResourceManager[9851]:  2010/08/06_01:31:58 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.1.103/24 start
IPaddr[9972]:   2010/08/06_01:31:58 INFO: Using calculated nic for
192.168.1.103: eth0
IPaddr[9972]:   2010/08/06_01:31:58 INFO: Using calculated netmask for
192.168.1.103: 255.255.255.0
IPaddr[9972]:   2010/08/06_01:31:59 INFO: eval ifconfig eth0:0 192.168.1.103
netmask 255.255.255.0 broadcast 192.168.1.255
IPaddr[9946]:   2010/08/06_01:31:59 INFO:  Success
mach_down[9825]:        2010/08/06_01:31:59 info:
/usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[9825]:        2010/08/06_01:31:59 info: mach_down takeover
complete for node app1.
heartbeat[2682]: 2010/08/06_01:31:59 info: mach_down takeover complete.
heartbeat[2682]: 2010/08/06_01:32:15 WARN: node app1: is dead
heartbeat[2682]: 2010/08/06_01:32:15 info: Dead node app1 gave up resources.
ipfail[2747]: 2010/08/06_01:32:15 info: Status update: Node app1 now has
status dead
heartbeat[2682]: 2010/08/06_01:32:15 info: Link app1:eth1 dead.
ipfail[2747]: 2010/08/06_01:32:15 info: NS: We are dead. :<
ipfail[2747]: 2010/08/06_01:32:16 info: Link Status update: Link app1/eth1
now has status dead
ipfail[2747]: 2010/08/06_01:32:17 info: We are dead. :<
ipfail[2747]: 2010/08/06_01:32:17 info: Asking other side for ping node
count.

When I try the first start of heartbeat on master after reboot (chkconfig
for heartbeat off), the following is shown:

Starting High-Availability services:
2010/08/06_01:36:05 INFO:  Running OK
2010/08/06_01:36:05 CRITICAL: Resource 192.168.1.103/24 is active, and
should not be!
2010/08/06_01:36:05 CRITICAL: Non-idle resources can affect data integrity!
2010/08/06_01:36:05 info: If you don't know what this means, then get help!
2010/08/06_01:36:05 info: Read the docs and/or source to
/usr/share/heartbeat/ResourceManager for more details.
CRITICAL: Resource 192.168.1.103/24 is active, and should not be!
CRITICAL: Non-idle resources can affect data integrity!
info: If you don't know what this means, then get help!
info: Read the docs and/or the source to
/usr/share/heartbeat/ResourceManager for more details.
2010/08/06_01:36:05 CRITICAL: Non-idle resources will affect resource
takeback!
2010/08/06_01:36:05 CRITICAL: Non-idle resources may affect data integrity!
                                                           [  OK  ]

And the following is added to /var/log/ha-log on master:

heartbeat[2879]: 2010/08/06_01:36:05 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[2879]: 2010/08/06_01:36:05 info: Version 2 support: false
heartbeat[2879]: 2010/08/06_01:36:05 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[2879]: 2010/08/06_01:36:05 info: **************************
heartbeat[2879]: 2010/08/06_01:36:05 info: Configuration validated. Starting
heartbeat 2.1.3
heartbeat[2880]: 2010/08/06_01:36:05 info: heartbeat: version 2.1.3
heartbeat[2880]: 2010/08/06_01:36:05 info: Heartbeat generation: 1280995153
heartbeat[2880]: 2010/08/06_01:36:05 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
heartbeat[2880]: 2010/08/06_01:36:05 info: glib: ucast: bound send socket to
device: eth1
heartbeat[2880]: 2010/08/06_01:36:05 info: glib: ucast: bound receive socket
to device: eth1
heartbeat[2880]: 2010/08/06_01:36:05 info: glib: ucast: started on port 694
interface eth1 to 10.179.80.55
heartbeat[2880]: 2010/08/06_01:36:05 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[2880]: 2010/08/06_01:36:05 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[2880]: 2010/08/06_01:36:05 info: G_main_add_SignalHandler: Added
signal handler for signal 17
heartbeat[2880]: 2010/08/06_01:36:05 info: Local status now set to: 'up'
heartbeat[2880]: 2010/08/06_01:36:06 info: Link failover:eth1 up.
heartbeat[2880]: 2010/08/06_01:36:06 info: Comm_now_up(): updating status to
active
heartbeat[2880]: 2010/08/06_01:36:06 info: Local status now set to: 'active'
heartbeat[2880]: 2010/08/06_01:36:06 info: Starting child client
"/usr/lib64/heartbeat/ipfail" (498,496)
heartbeat[2888]: 2010/08/06_01:36:06 info: Starting
"/usr/lib64/heartbeat/ipfail" as uid 498  gid 496 (pid 2888)
heartbeat[2880]: 2010/08/06_01:36:07 info: Status update for node failover:
status active
heartbeat[2880]: 2010/08/06_01:36:07 info: remote resource transition
completed.
heartbeat[2880]: 2010/08/06_01:36:07 info: remote resource transition
completed.
heartbeat[2880]: 2010/08/06_01:36:07 info: Local Resource acquisition
completed. (none)
harc[2891]: 2010/08/06_01:36:07 info: Running /etc/ha.d/rc.d/status status
heartbeat[2880]: 2010/08/06_01:36:07 info: failover wants to go standby
[foreign]
heartbeat[2880]: 2010/08/06_01:36:08 info: standby: acquire [foreign]
resources from failover
heartbeat[2907]: 2010/08/06_01:36:08 info: acquire local HA resources
(standby).
ResourceManager[2920]: 2010/08/06_01:36:08 info: Acquiring resource group:
app1 192.168.1.103/24
IPaddr[2947]: 2010/08/06_01:36:09 INFO:  Running OK
heartbeat[2907]: 2010/08/06_01:36:09 info: local HA resource acquisition
completed (standby).
heartbeat[2880]: 2010/08/06_01:36:09 info: Standby resource acquisition done
[foreign].
heartbeat[2880]: 2010/08/06_01:36:09 info: Initial resource acquisition
complete (auto_failback)
heartbeat[2880]: 2010/08/06_01:36:09 info: remote resource transition
completed.
ipfail[2888]: 2010/08/06_01:36:10 info: Status update: Node failover now has
status active
ipfail[2888]: 2010/08/06_01:36:14 info: Ping node count is balanced.
ipfail[2888]: 2010/08/06_01:36:14 info: Giving up foreign resources
(auto_failback).
ipfail[2888]: 2010/08/06_01:36:14 info: Delayed giveup in 4 seconds.
ipfail[2888]: 2010/08/06_01:36:18 info: giveup() called (timeout worked)
heartbeat[2880]: 2010/08/06_01:36:19 info: app1 wants to go standby
[foreign]
heartbeat[2880]: 2010/08/06_01:36:19 info: standby: failover can take our
foreign resources
heartbeat[2992]: 2010/08/06_01:36:19 info: give up foreign HA resources
(standby).
heartbeat[2992]: 2010/08/06_01:36:19 info: foreign HA resource release
completed (standby).
heartbeat[2880]: 2010/08/06_01:36:19 info: Local standby process completed
[foreign].
heartbeat[2880]: 2010/08/06_01:36:20 WARN: 1 lost packet(s) for [failover]
[7566:7568]
heartbeat[2880]: 2010/08/06_01:36:20 info: remote resource transition
completed.
heartbeat[2880]: 2010/08/06_01:36:20 info: No pkts missing from failover!
heartbeat[2880]: 2010/08/06_01:36:20 info: Other node completed standby
takeover of foreign resources.

The following is added to /var/log/ha-log on slave:

heartbeat[2682]: 2010/08/06_01:36:22 info: Heartbeat restart on node app1
heartbeat[2682]: 2010/08/06_01:36:22 info: Link app1:eth1 up.
heartbeat[2682]: 2010/08/06_01:36:22 info: Status update for node app1:
status init
ipfail[2747]: 2010/08/06_01:36:22 info: Link Status update: Link app1/eth1
now has status up
heartbeat[2682]: 2010/08/06_01:36:22 info: Status update for node app1:
status up
ipfail[2747]: 2010/08/06_01:36:22 info: Status update: Node app1 now has
status init
ipfail[2747]: 2010/08/06_01:36:22 info: Status update: Node app1 now has
status up
harc[10090]: 2010/08/06_01:36:22 info: Running /etc/ha.d/rc.d/status status
harc[10106]: 2010/08/06_01:36:22 info: Running /etc/ha.d/rc.d/status status
heartbeat[2682]: 2010/08/06_01:36:23 info: Status update for node app1:
status active
ipfail[2747]: 2010/08/06_01:36:23 info: Status update: Node app1 now has
status active
harc[10122]: 2010/08/06_01:36:23 info: Running /etc/ha.d/rc.d/status status
heartbeat[2682]: 2010/08/06_01:36:23 info: remote resource transition
completed.
heartbeat[2682]: 2010/08/06_01:36:23 info: failover wants to go standby
[foreign]
heartbeat[2682]: 2010/08/06_01:36:24 info: standby: app1 can take our
foreign resources
heartbeat[10138]: 2010/08/06_01:36:24 info: give up foreign HA resources
(standby).
ipfail[2747]: 2010/08/06_01:36:24 info: Asking other side for ping node
count.
ResourceManager[10151]: 2010/08/06_01:36:24 info: Releasing resource group:
app1 192.168.1.103/24
ResourceManager[10151]: 2010/08/06_01:36:24 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.1.103/24 stop
IPaddr[10217]: 2010/08/06_01:36:24 INFO: ifconfig eth0:0 down
IPaddr[10191]: 2010/08/06_01:36:24 INFO:  Success
heartbeat[10138]: 2010/08/06_01:36:24 info: foreign HA resource release
completed (standby).
heartbeat[2682]: 2010/08/06_01:36:24 info: Local standby process completed
[foreign].
heartbeat[2682]: 2010/08/06_01:36:25 WARN: 1 lost packet(s) for [app1]
[11:13]
heartbeat[2682]: 2010/08/06_01:36:25 info: remote resource transition
completed.
heartbeat[2682]: 2010/08/06_01:36:25 info: No pkts missing from app1!
heartbeat[2682]: 2010/08/06_01:36:25 info: Other node completed standby
takeover of foreign resources.
ipfail[2747]: 2010/08/06_01:36:31 info: No giveup timer to abort.
heartbeat[2682]: 2010/08/06_01:36:36 info: app1 wants to go standby
[foreign]
heartbeat[2682]: 2010/08/06_01:36:36 info: standby: acquire [foreign]
resources from app1
heartbeat[10247]: 2010/08/06_01:36:36 info: acquire local HA resources
(standby).
heartbeat[10247]: 2010/08/06_01:36:36 info: local HA resource acquisition
completed (standby).
heartbeat[2682]: 2010/08/06_01:36:36 info: Standby resource acquisition done
[foreign].
heartbeat[2682]: 2010/08/06_01:36:37 info: remote resource transition
completed.

When RESTARTING heartbeat on the master, the master takes over and begins
responding to requests. The following is added to the /var/log/ha-log on the
master:

heartbeat[2880]: 2010/08/06_01:36:20 info: Other node completed standby
takeover of foreign resources.
heartbeat[2880]: 2010/08/06_01:40:19 info: Heartbeat shutdown in progress.
(2880)
heartbeat[3052]: 2010/08/06_01:40:19 info: Giving up all HA resources.
ResourceManager[3065]: 2010/08/06_01:40:19 info: Releasing resource group:
app1 192.168.1.103/24
ResourceManager[3065]: 2010/08/06_01:40:19 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.1.103/24 stop
IPaddr[3131]: 2010/08/06_01:40:19 INFO: ifconfig eth0:1 down
IPaddr[3105]: 2010/08/06_01:40:19 INFO:  Success
heartbeat[3052]: 2010/08/06_01:40:19 info: All HA resources relinquished.
heartbeat[2880]: 2010/08/06_01:40:20 WARN: 1 lost packet(s) for [failover]
[7688:7690]
heartbeat[2880]: 2010/08/06_01:40:20 info: No pkts missing from failover!
heartbeat[2880]: 2010/08/06_01:40:20 info: killing
/usr/lib64/heartbeat/ipfail process group 2888 with signal 15
heartbeat[2880]: 2010/08/06_01:40:22 info: killing HBFIFO process 2883 with
signal 15
heartbeat[2880]: 2010/08/06_01:40:22 info: killing HBWRITE process 2884 with
signal 15
heartbeat[2880]: 2010/08/06_01:40:22 info: killing HBREAD process 2885 with
signal 15
heartbeat[2880]: 2010/08/06_01:40:22 info: Core process 2884 exited. 3
remaining
heartbeat[2880]: 2010/08/06_01:40:22 info: Core process 2883 exited. 2
remaining
heartbeat[2880]: 2010/08/06_01:40:22 info: Core process 2885 exited. 1
remaining
heartbeat[2880]: 2010/08/06_01:40:22 info: app1 Heartbeat shutdown complete.
heartbeat[3250]: 2010/08/06_01:40:48 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[3250]: 2010/08/06_01:40:48 info: Version 2 support: false
heartbeat[3250]: 2010/08/06_01:40:48 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[3250]: 2010/08/06_01:40:48 info: **************************
heartbeat[3250]: 2010/08/06_01:40:48 info: Configuration validated. Starting
heartbeat 2.1.3
heartbeat[3251]: 2010/08/06_01:40:48 info: heartbeat: version 2.1.3
heartbeat[3251]: 2010/08/06_01:40:48 info: Heartbeat generation: 1280995154
heartbeat[3251]: 2010/08/06_01:40:48 info: glib: ucast: write socket
priority set to IPTOS_LOWDELAY on eth1
heartbeat[3251]: 2010/08/06_01:40:48 info: glib: ucast: bound send socket to
device: eth1
heartbeat[3251]: 2010/08/06_01:40:48 info: glib: ucast: bound receive socket
to device: eth1
heartbeat[3251]: 2010/08/06_01:40:48 info: glib: ucast: started on port 694
interface eth1 to 10.179.80.55
heartbeat[3251]: 2010/08/06_01:40:48 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[3251]: 2010/08/06_01:40:48 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[3251]: 2010/08/06_01:40:48 info: G_main_add_SignalHandler: Added
signal handler for signal 17
heartbeat[3251]: 2010/08/06_01:40:48 info: Local status now set to: 'up'
heartbeat[3251]: 2010/08/06_01:40:49 info: Link failover:eth1 up.
heartbeat[3251]: 2010/08/06_01:40:49 info: Comm_now_up(): updating status to
active
heartbeat[3251]: 2010/08/06_01:40:49 info: Local status now set to: 'active'
heartbeat[3251]: 2010/08/06_01:40:49 info: Starting child client
"/usr/lib64/heartbeat/ipfail" (498,496)
heartbeat[3258]: 2010/08/06_01:40:49 info: Starting
"/usr/lib64/heartbeat/ipfail" as uid 498  gid 496 (pid 3258)
heartbeat[3251]: 2010/08/06_01:40:50 info: remote resource transition
completed.
heartbeat[3251]: 2010/08/06_01:40:50 info: remote resource transition
completed.
heartbeat[3251]: 2010/08/06_01:40:50 info: Local Resource acquisition
completed. (none)
heartbeat[3251]: 2010/08/06_01:40:50 info: Status update for node failover:
status active
heartbeat[3251]: 2010/08/06_01:40:50 info: failover wants to go standby
[foreign]
harc[3261]: 2010/08/06_01:40:50 info: Running /etc/ha.d/rc.d/status status
heartbeat[3251]: 2010/08/06_01:40:51 info: standby: acquire [foreign]
resources from failover
heartbeat[3277]: 2010/08/06_01:40:51 info: acquire local HA resources
(standby).
ResourceManager[3290]: 2010/08/06_01:40:51 info: Acquiring resource group:
app1 192.168.1.103/24
IPaddr[3317]: 2010/08/06_01:40:51 INFO:  Resource is stopped
ResourceManager[3290]: 2010/08/06_01:40:51 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.1.103/24 start
IPaddr[3411]: 2010/08/06_01:40:52 INFO: Using calculated nic for
192.168.1.103: eth0
IPaddr[3411]: 2010/08/06_01:40:52 INFO: Using calculated netmask for
192.168.1.103: 255.255.255.0
IPaddr[3411]: 2010/08/06_01:40:52 INFO: eval ifconfig eth0:0 192.168.1.103
netmask 255.255.255.0 broadcast 192.168.1.255
IPaddr[3385]: 2010/08/06_01:40:52 INFO:  Success
heartbeat[3277]: 2010/08/06_01:40:52 info: local HA resource acquisition
completed (standby).
heartbeat[3251]: 2010/08/06_01:40:52 info: Standby resource acquisition done
[foreign].
heartbeat[3251]: 2010/08/06_01:40:52 info: Initial resource acquisition
complete (auto_failback)
heartbeat[3251]: 2010/08/06_01:40:52 info: remote resource transition
completed.
ipfail[3258]: 2010/08/06_01:40:53 info: Status update: Node failover now has
status active
ipfail[3258]: 2010/08/06_01:40:56 info: Ping node count is balanced.
ipfail[3258]: 2010/08/06_01:40:56 info: Giving up foreign resources
(auto_failback).
ipfail[3258]: 2010/08/06_01:40:56 info: Delayed giveup in 4 seconds.
ipfail[3258]: 2010/08/06_01:41:00 info: giveup() called (timeout worked)
heartbeat[3251]: 2010/08/06_01:41:01 info: app1 wants to go standby
[foreign]
heartbeat[3251]: 2010/08/06_01:41:01 info: standby: failover can take our
foreign resources
heartbeat[3512]: 2010/08/06_01:41:01 info: give up foreign HA resources
(standby).
heartbeat[3512]: 2010/08/06_01:41:01 info: foreign HA resource release
completed (standby).
heartbeat[3251]: 2010/08/06_01:41:01 info: Local standby process completed
[foreign].
heartbeat[3251]: 2010/08/06_01:41:02 WARN: 1 lost packet(s) for [failover]
[7725:7727]
heartbeat[3251]: 2010/08/06_01:41:02 info: remote resource transition
completed.
heartbeat[3251]: 2010/08/06_01:41:02 info: No pkts missing from failover!
heartbeat[3251]: 2010/08/06_01:41:02 info: Other node completed standby
takeover of foreign resources.

The /var/log/ha-log on the slave gets:

heartbeat[2682]: 2010/08/06_01:40:36 info: Received shutdown notice from
'app1'.
heartbeat[2682]: 2010/08/06_01:40:36 info: Resources being acquired from
app1.
heartbeat[10278]: 2010/08/06_01:40:36 info: acquire local HA resources
(standby).
heartbeat[10278]: 2010/08/06_01:40:36 info: local HA resource acquisition
completed (standby).
heartbeat[10279]: 2010/08/06_01:40:36 info: No local resources
[/usr/share/heartbeat/ResourceManager listkeys failover] to acquire.
heartbeat[2682]: 2010/08/06_01:40:36 info: Standby resource acquisition done
[foreign].
harc[10304]: 2010/08/06_01:40:36 info: Running /etc/ha.d/rc.d/status status
mach_down[10320]: 2010/08/06_01:40:36 info: Taking over resource group
192.168.1.103/24
ResourceManager[10346]: 2010/08/06_01:40:36 info: Acquiring resource group:
app1 192.168.1.103/24
IPaddr[10373]: 2010/08/06_01:40:36 INFO:  Resource is stopped
ResourceManager[10346]: 2010/08/06_01:40:36 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.1.103/24 start
IPaddr[10467]: 2010/08/06_01:40:37 INFO: Using calculated nic for
192.168.1.103: eth0
IPaddr[10467]: 2010/08/06_01:40:37 INFO: Using calculated netmask for
192.168.1.103: 255.255.255.0
IPaddr[10467]: 2010/08/06_01:40:37 INFO: eval ifconfig eth0:0 192.168.1.103
netmask 255.255.255.0 broadcast 192.168.1.255
IPaddr[10441]: 2010/08/06_01:40:37 INFO:  Success
mach_down[10320]: 2010/08/06_01:40:37 info: /usr/share/heartbeat/mach_down:
nice_failback: foreign resources acquired
mach_down[10320]: 2010/08/06_01:40:37 info: mach_down takeover complete for
node app1.
heartbeat[2682]: 2010/08/06_01:40:37 info: mach_down takeover complete.
heartbeat[2682]: 2010/08/06_01:40:53 WARN: node app1: is dead
heartbeat[2682]: 2010/08/06_01:40:53 info: Dead node app1 gave up resources.
ipfail[2747]: 2010/08/06_01:40:53 info: Status update: Node app1 now has
status dead
heartbeat[2682]: 2010/08/06_01:40:54 info: Link app1:eth1 dead.
ipfail[2747]: 2010/08/06_01:40:54 info: NS: We are dead. :<
ipfail[2747]: 2010/08/06_01:40:54 info: Link Status update: Link app1/eth1
now has status dead
ipfail[2747]: 2010/08/06_01:40:55 info: We are dead. :<
ipfail[2747]: 2010/08/06_01:40:55 info: Asking other side for ping node
count.
heartbeat[2682]: 2010/08/06_01:41:05 info: Heartbeat restart on node app1
heartbeat[2682]: 2010/08/06_01:41:05 info: Link app1:eth1 up.
heartbeat[2682]: 2010/08/06_01:41:05 info: Status update for node app1:
status init
ipfail[2747]: 2010/08/06_01:41:05 info: Link Status update: Link app1/eth1
now has status up
heartbeat[2682]: 2010/08/06_01:41:05 info: Status update for node app1:
status up
ipfail[2747]: 2010/08/06_01:41:05 info: Status update: Node app1 now has
status init
ipfail[2747]: 2010/08/06_01:41:05 info: Status update: Node app1 now has
status up
harc[10580]: 2010/08/06_01:41:05 info: Running /etc/ha.d/rc.d/status status
harc[10596]: 2010/08/06_01:41:05 info: Running /etc/ha.d/rc.d/status status
heartbeat[2682]: 2010/08/06_01:41:06 info: Status update for node app1:
status active
ipfail[2747]: 2010/08/06_01:41:06 info: Status update: Node app1 now has
status active
harc[10612]: 2010/08/06_01:41:06 info: Running /etc/ha.d/rc.d/status status
heartbeat[2682]: 2010/08/06_01:41:06 info: remote resource transition
completed.
heartbeat[2682]: 2010/08/06_01:41:06 info: failover wants to go standby
[foreign]
ipfail[2747]: 2010/08/06_01:41:06 info: Asking other side for ping node
count.
heartbeat[2682]: 2010/08/06_01:41:07 info: standby: app1 can take our
foreign resources
heartbeat[10628]: 2010/08/06_01:41:07 info: give up foreign HA resources
(standby).
ResourceManager[10641]: 2010/08/06_01:41:07 info: Releasing resource group:
app1 192.168.1.103/24
ResourceManager[10641]: 2010/08/06_01:41:07 info: Running
/etc/ha.d/resource.d/IPaddr 192.168.1.103/24 stop
IPaddr[10707]: 2010/08/06_01:41:07 INFO: ifconfig eth0:0 down
IPaddr[10681]: 2010/08/06_01:41:07 INFO:  Success
heartbeat[10628]: 2010/08/06_01:41:07 info: foreign HA resource release
completed (standby).
heartbeat[2682]: 2010/08/06_01:41:07 info: Local standby process completed
[foreign].
heartbeat[2682]: 2010/08/06_01:41:08 WARN: 1 lost packet(s) for [app1]
[11:13]
heartbeat[2682]: 2010/08/06_01:41:08 info: remote resource transition
completed.
heartbeat[2682]: 2010/08/06_01:41:08 info: No pkts missing from app1!
heartbeat[2682]: 2010/08/06_01:41:08 info: Other node completed standby
takeover of foreign resources.
ipfail[2747]: 2010/08/06_01:41:13 info: No giveup timer to abort.
heartbeat[2682]: 2010/08/06_01:41:18 info: app1 wants to go standby
[foreign]
heartbeat[2682]: 2010/08/06_01:41:18 info: standby: acquire [foreign]
resources from app1
heartbeat[10737]: 2010/08/06_01:41:18 info: acquire local HA resources
(standby).
heartbeat[10737]: 2010/08/06_01:41:18 info: local HA resource acquisition
completed (standby).
heartbeat[2682]: 2010/08/06_01:41:18 info: Standby resource acquisition done
[foreign].
heartbeat[2682]: 2010/08/06_01:41:19 info: remote resource transition
completed.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] heartbeat startup causes shared IP to stop responding

Reply via email to