[Linux-HA] secondary does not replace primary if it's booted while primary is off.

Scott Edwards Tue, 16 Dec 2008 03:51:48 -0800

I've tested a few different scenarios, and there's one that's got me
perplexed.  I first started reading http://www.linuxjournal.com/article/5862.
I've used the boiler plate docs in
/usr/share/doc/heartbeat-2/GettingStarted.html, and Google to search the
mailing lists in an attempt to understand how to operate HA the way I would
hope and expect it to run.


I'm testing this on two fairly similar pieces of hardware. (no
virtuilziation etc). System specs are at least P2Ghz 512MBram.  As you'll
see below, I'm using a null modem cable too.  I'm testing this to prove the
setup before it's deployed.  For the time being I'm only using a lean web
server just to have some service to probe and test on.

Expected: (at least by me)

1. When only the primary is booted, it takes the resources just fine.
2. When both systems are running, and primary is active, secondary can be
shutdown (init 0), abrubtly shutdown (sysrq u, s, o) unmount sync off. and
primary stays active.
3. When secondary is active, primary takes over as soon as it can.

Unexpected: (again, my perspective)

When the primary is off, and the secondary is booted, it will not take
resources.

1. primary: init 0
2. secondary: init 6

After these steps, I want the secondary (even after 20 seconds or so) to
jump up and assume the active role..  My continuous ping shows fifteen
minutes and counting. I don't think secondary will become active (
master.example.com).

Here are the related config files mentioned in the FAQ. (and others)

The systems are running Debian Etch.

secondary:~# dpkg -l heartb\*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Installed/Config-files/Unpacked/Failed-config/Half-installed
|/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err:
uppercase=bad)
||/ Name                        Version                     Description
+++-===========================-===========================-======================================================================
un  heartbeat                   <none>                      (no description
available)
ii  heartbeat-2                 2.0.7-2                     Subsystem for
High-Availability Linux

secondary:~# cat /etc/ha.d/ha.cf
serial          /dev/ttyS1
watchdog        /dev/watchdog
debugfile       /var/log/ha-debug
logfile         /var/log/ha-log
logfacility     local0
keepalive       2
deadtime        10
udpport         694
bcast           eth0
node            primary
node            secondary
ping            10.141.0.1
auto_failback   on

secondary:~# cat /etc/ha.d/haresources
primary 10.141.2.7 nginx

secondary:~# cat /etc/hosts | sed s/not-important/example/g
127.0.0.1       localhost
10.141.0.1      router.example.com        router
10.141.2.7      master.example.com      master
10.141.2.8      primary.example.com     primary
10.141.2.9      secondary.example.com   secondary

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

secondary:~# uname -n
secondary
secondary:~# hostname
secondary
secondary:~# hostname -f | sed s/not-important/example/
secondary.example.com
secondary:~# cat /etc/resolv.conf
search example.com
nameserver 10.141.0.1

secondary:~# ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:E0:18:BE:0E:51
          inet addr:10.141.2.9  Bcast:10.141.7.255  Mask:255.255.248.0
          inet6 addr: fe80::2e0:18ff:febe:e51/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3623 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2784 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:663897 (648.3 KiB)  TX bytes:827959 (808.5 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:168 (168.0 b)  TX bytes:168 (168.0 b)

On boot, it will take the active role for just a few seconds,

>From 10.141.10.1 icmp_seq=25263 Destination Host Unreachable
64 bytes from 10.141.2.7: icmp_seq=25264 ttl=63 time=539 ms
64 bytes from 10.141.2.7: icmp_seq=25265 ttl=63 time=0.741 ms
64 bytes from 10.141.2.7: icmp_seq=25266 ttl=63 time=0.764 ms
>From 10.141.10.1 icmp_seq=25274 Destination Host Unreachable

For this post, I shutdown heartbeat, rotated the logs, and rebooted. (to
reduce extra logs).  If they're really needed, I can attach them in a follow
up reply.

secondary:~# cat /var/log/ha-log
heartbeat[2203]: 2008/12/16_02:25:51 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[2203]: 2008/12/16_02:25:51 info: **************************
heartbeat[2203]: 2008/12/16_02:25:51 info: Configuration validated. Starting
heartbeat 2.0.7
heartbeat[2204]: 2008/12/16_02:25:51 info: heartbeat: version 2.0.7
heartbeat[2204]: 2008/12/16_02:25:58 info: Heartbeat generation: 8
heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[2204]: 2008/12/16_02:25:58 info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
heartbeat[2204]: 2008/12/16_02:25:58 info: glib: Starting serial heartbeat
on tty /dev/ttyS1 (19200 baud)
heartbeat[2204]: 2008/12/16_02:25:58 info: glib: UDP Broadcast heartbeat
started on port 694 (694) interface eth0
heartbeat[2204]: 2008/12/16_02:25:58 info: glib: UDP Broadcast heartbeat
closed on port 694 interface eth0 - Status: 1
heartbeat[2204]: 2008/12/16_02:25:58 info: glib: ping heartbeat started.
heartbeat[2204]: 2008/12/16_02:25:58 notice: Using watchdog device:
/dev/watchdog
heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_SignalHandler: Added
signal handler for signal 17
heartbeat[2204]: 2008/12/16_02:25:58 info: Local status now set to: 'up'
heartbeat[2204]: 2008/12/16_02:25:59 info: Link 10.141.0.1:10.141.0.1 up.
heartbeat[2204]: 2008/12/16_02:25:59 info: Status update for node 10.141.0.1:
status ping
heartbeat[2204]: 2008/12/16_02:25:59 info: Link secondary:eth0 up.
heartbeat[2204]: 2008/12/16_02:26:18 WARN: node primary: is dead
heartbeat[2204]: 2008/12/16_02:26:18 info: Comm_now_up(): updating status to
active
heartbeat[2204]: 2008/12/16_02:26:18 info: Local status now set to: 'active'
heartbeat[2204]: 2008/12/16_02:26:18 WARN: No STONITH device configured.
heartbeat[2204]: 2008/12/16_02:26:18 WARN: Shared disks are not protected.
heartbeat[2204]: 2008/12/16_02:26:18 info: Resources being acquired from
primary.
harc[2272]:     2008/12/16_02:26:18 info: Running /etc/ha.d/rc.d/status
status
heartbeat[2273]: 2008/12/16_02:26:18 info: No local resources
[/usr/lib/heartbeat/ResourceManager listkeys secondary] to acquire.
mach_down[2292]:        2008/12/16_02:26:19 info: Taking over resource group
10.141.2.7
ResourceManager[2312]:  2008/12/16_02:26:19 info: Acquiring resource group:
primary 10.141.2.7 nginx
IPaddr[2336]:   2008/12/16_02:26:19 INFO: IPaddr Resource is stopped
ResourceManager[2312]:  2008/12/16_02:26:19 info: Running
/etc/ha.d/resource.d/IPaddr 10.141.2.7 start
IPaddr[2513]:   2008/12/16_02:26:19 INFO: eval /sbin/ifconfig eth0:0
10.141.2.7 netmask 255.255.248.0 broadcast 10.141.7.255
IPaddr[2513]:   2008/12/16_02:26:19 INFO: Sending Gratuitous Arp for
10.141.2.7 on eth0:0 [eth0]
IPaddr[2513]:   2008/12/16_02:26:19 INFO: /usr/lib/heartbeat/send_arp -i 500
-r 10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.141.2.7 eth0
10.141.2.7 auto 10.141.2.7 ffffffffffff
IPaddr[2443]:   2008/12/16_02:26:19 INFO: IPaddr Success
ResourceManager[2312]:  2008/12/16_02:26:19 info: Running /etc/init.d/nginx
start
ResourceManager[2312]:  2008/12/16_02:26:19 ERROR: Return code 1 from
/etc/init.d/nginx
ResourceManager[2312]:  2008/12/16_02:26:19 CRIT: Giving up resources due to
failure of nginx
ResourceManager[2312]:  2008/12/16_02:26:19 info: Releasing resource group:
primary 10.141.2.7 nginx
ResourceManager[2312]:  2008/12/16_02:26:19 info: Running /etc/init.d/nginx
stop
ResourceManager[2312]:  2008/12/16_02:26:19 info: Running
/etc/ha.d/resource.d/IPaddr 10.141.2.7 stop
IPaddr[2750]:   2008/12/16_02:26:19 INFO: /sbin/route -n del -host
10.141.2.7
IPaddr[2750]:   2008/12/16_02:26:19 INFO: /sbin/ifconfig eth0:0 10.141.2.7
down
IPaddr[2750]:   2008/12/16_02:26:19 INFO: IP Address 10.141.2.7 released
IPaddr[2680]:   2008/12/16_02:26:19 INFO: IPaddr Success
mach_down[2292]:        2008/12/16_02:26:19 info:
/usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[2292]:        2008/12/16_02:26:19 info: mach_down takeover
complete for node primary.
heartbeat[2204]: 2008/12/16_02:26:19 info: mach_down takeover complete.
heartbeat[2204]: 2008/12/16_02:26:19 info: Initial resource acquisition
complete (mach_down)
heartbeat[2261]: 2008/12/16_02:26:20 WARN: glib: TTY write timeout on
[/dev/ttyS1] (no connection or bad cable? [see documentation])
heartbeat[2261]: 2008/12/16_02:26:20 info: glib: See
http://linux-ha.org/FAQ#TTYtimeout for details
heartbeat[2204]: 2008/12/16_02:26:29 info: Local Resource acquisition
completed. (none)
heartbeat[2204]: 2008/12/16_02:26:29 info: local resource transition
completed.
hb_standby[2812]:       2008/12/16_02:26:49 Going standby [foreign].
heartbeat[2204]: 2008/12/16_02:26:50 info: secondary wants to go standby
[foreign]
heartbeat[2204]: 2008/12/16_02:27:00 WARN: No reply to standby request.
Standby request cancelled.




Thanks for your time and additional assistance.


Scott
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] secondary does not replace primary if it's booted while primary is off.

Reply via email to