I've tested a few different scenarios, and there's one that's got me
perplexed. I first started reading http://www.linuxjournal.com/article/5862.
I've used the boiler plate docs in
/usr/share/doc/heartbeat-2/GettingStarted.html, and Google to search the
mailing lists in an attempt to understand how to operate HA the way I would
hope and expect it to run.
I'm testing this on two fairly similar pieces of hardware. (no
virtuilziation etc). System specs are at least P2Ghz 512MBram. As you'll
see below, I'm using a null modem cable too. I'm testing this to prove the
setup before it's deployed. For the time being I'm only using a lean web
server just to have some service to probe and test on.
Expected: (at least by me)
1. When only the primary is booted, it takes the resources just fine.
2. When both systems are running, and primary is active, secondary can be
shutdown (init 0), abrubtly shutdown (sysrq u, s, o) unmount sync off. and
primary stays active.
3. When secondary is active, primary takes over as soon as it can.
Unexpected: (again, my perspective)
When the primary is off, and the secondary is booted, it will not take
resources.
1. primary: init 0
2. secondary: init 6
After these steps, I want the secondary (even after 20 seconds or so) to
jump up and assume the active role.. My continuous ping shows fifteen
minutes and counting. I don't think secondary will become active (
master.example.com).
Here are the related config files mentioned in the FAQ. (and others)
The systems are running Debian Etch.
secondary:~# dpkg -l heartb\*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Installed/Config-files/Unpacked/Failed-config/Half-installed
|/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err:
uppercase=bad)
||/ Name Version Description
+++-===========================-===========================-======================================================================
un heartbeat <none> (no description
available)
ii heartbeat-2 2.0.7-2 Subsystem for
High-Availability Linux
secondary:~# cat /etc/ha.d/ha.cf
serial /dev/ttyS1
watchdog /dev/watchdog
debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 10
udpport 694
bcast eth0
node primary
node secondary
ping 10.141.0.1
auto_failback on
secondary:~# cat /etc/ha.d/haresources
primary 10.141.2.7 nginx
secondary:~# cat /etc/hosts | sed s/not-important/example/g
127.0.0.1 localhost
10.141.0.1 router.example.com router
10.141.2.7 master.example.com master
10.141.2.8 primary.example.com primary
10.141.2.9 secondary.example.com secondary
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
secondary:~# uname -n
secondary
secondary:~# hostname
secondary
secondary:~# hostname -f | sed s/not-important/example/
secondary.example.com
secondary:~# cat /etc/resolv.conf
search example.com
nameserver 10.141.0.1
secondary:~# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:E0:18:BE:0E:51
inet addr:10.141.2.9 Bcast:10.141.7.255 Mask:255.255.248.0
inet6 addr: fe80::2e0:18ff:febe:e51/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3623 errors:0 dropped:0 overruns:0 frame:0
TX packets:2784 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:663897 (648.3 KiB) TX bytes:827959 (808.5 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:2 errors:0 dropped:0 overruns:0 frame:0
TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:168 (168.0 b) TX bytes:168 (168.0 b)
On boot, it will take the active role for just a few seconds,
>From 10.141.10.1 icmp_seq=25263 Destination Host Unreachable
64 bytes from 10.141.2.7: icmp_seq=25264 ttl=63 time=539 ms
64 bytes from 10.141.2.7: icmp_seq=25265 ttl=63 time=0.741 ms
64 bytes from 10.141.2.7: icmp_seq=25266 ttl=63 time=0.764 ms
>From 10.141.10.1 icmp_seq=25274 Destination Host Unreachable
For this post, I shutdown heartbeat, rotated the logs, and rebooted. (to
reduce extra logs). If they're really needed, I can attach them in a follow
up reply.
secondary:~# cat /var/log/ha-log
heartbeat[2203]: 2008/12/16_02:25:51 WARN: Logging daemon is disabled
--enabling logging daemon is recommended
heartbeat[2203]: 2008/12/16_02:25:51 info: **************************
heartbeat[2203]: 2008/12/16_02:25:51 info: Configuration validated. Starting
heartbeat 2.0.7
heartbeat[2204]: 2008/12/16_02:25:51 info: heartbeat: version 2.0.7
heartbeat[2204]: 2008/12/16_02:25:58 info: Heartbeat generation: 8
heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_TriggerHandler: Added
signal manual handler
heartbeat[2204]: 2008/12/16_02:25:58 info: Removing
/var/run/heartbeat/rsctmp failed, recreating.
heartbeat[2204]: 2008/12/16_02:25:58 info: glib: Starting serial heartbeat
on tty /dev/ttyS1 (19200 baud)
heartbeat[2204]: 2008/12/16_02:25:58 info: glib: UDP Broadcast heartbeat
started on port 694 (694) interface eth0
heartbeat[2204]: 2008/12/16_02:25:58 info: glib: UDP Broadcast heartbeat
closed on port 694 interface eth0 - Status: 1
heartbeat[2204]: 2008/12/16_02:25:58 info: glib: ping heartbeat started.
heartbeat[2204]: 2008/12/16_02:25:58 notice: Using watchdog device:
/dev/watchdog
heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_SignalHandler: Added
signal handler for signal 17
heartbeat[2204]: 2008/12/16_02:25:58 info: Local status now set to: 'up'
heartbeat[2204]: 2008/12/16_02:25:59 info: Link 10.141.0.1:10.141.0.1 up.
heartbeat[2204]: 2008/12/16_02:25:59 info: Status update for node 10.141.0.1:
status ping
heartbeat[2204]: 2008/12/16_02:25:59 info: Link secondary:eth0 up.
heartbeat[2204]: 2008/12/16_02:26:18 WARN: node primary: is dead
heartbeat[2204]: 2008/12/16_02:26:18 info: Comm_now_up(): updating status to
active
heartbeat[2204]: 2008/12/16_02:26:18 info: Local status now set to: 'active'
heartbeat[2204]: 2008/12/16_02:26:18 WARN: No STONITH device configured.
heartbeat[2204]: 2008/12/16_02:26:18 WARN: Shared disks are not protected.
heartbeat[2204]: 2008/12/16_02:26:18 info: Resources being acquired from
primary.
harc[2272]: 2008/12/16_02:26:18 info: Running /etc/ha.d/rc.d/status
status
heartbeat[2273]: 2008/12/16_02:26:18 info: No local resources
[/usr/lib/heartbeat/ResourceManager listkeys secondary] to acquire.
mach_down[2292]: 2008/12/16_02:26:19 info: Taking over resource group
10.141.2.7
ResourceManager[2312]: 2008/12/16_02:26:19 info: Acquiring resource group:
primary 10.141.2.7 nginx
IPaddr[2336]: 2008/12/16_02:26:19 INFO: IPaddr Resource is stopped
ResourceManager[2312]: 2008/12/16_02:26:19 info: Running
/etc/ha.d/resource.d/IPaddr 10.141.2.7 start
IPaddr[2513]: 2008/12/16_02:26:19 INFO: eval /sbin/ifconfig eth0:0
10.141.2.7 netmask 255.255.248.0 broadcast 10.141.7.255
IPaddr[2513]: 2008/12/16_02:26:19 INFO: Sending Gratuitous Arp for
10.141.2.7 on eth0:0 [eth0]
IPaddr[2513]: 2008/12/16_02:26:19 INFO: /usr/lib/heartbeat/send_arp -i 500
-r 10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.141.2.7 eth0
10.141.2.7 auto 10.141.2.7 ffffffffffff
IPaddr[2443]: 2008/12/16_02:26:19 INFO: IPaddr Success
ResourceManager[2312]: 2008/12/16_02:26:19 info: Running /etc/init.d/nginx
start
ResourceManager[2312]: 2008/12/16_02:26:19 ERROR: Return code 1 from
/etc/init.d/nginx
ResourceManager[2312]: 2008/12/16_02:26:19 CRIT: Giving up resources due to
failure of nginx
ResourceManager[2312]: 2008/12/16_02:26:19 info: Releasing resource group:
primary 10.141.2.7 nginx
ResourceManager[2312]: 2008/12/16_02:26:19 info: Running /etc/init.d/nginx
stop
ResourceManager[2312]: 2008/12/16_02:26:19 info: Running
/etc/ha.d/resource.d/IPaddr 10.141.2.7 stop
IPaddr[2750]: 2008/12/16_02:26:19 INFO: /sbin/route -n del -host
10.141.2.7
IPaddr[2750]: 2008/12/16_02:26:19 INFO: /sbin/ifconfig eth0:0 10.141.2.7
down
IPaddr[2750]: 2008/12/16_02:26:19 INFO: IP Address 10.141.2.7 released
IPaddr[2680]: 2008/12/16_02:26:19 INFO: IPaddr Success
mach_down[2292]: 2008/12/16_02:26:19 info:
/usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
mach_down[2292]: 2008/12/16_02:26:19 info: mach_down takeover
complete for node primary.
heartbeat[2204]: 2008/12/16_02:26:19 info: mach_down takeover complete.
heartbeat[2204]: 2008/12/16_02:26:19 info: Initial resource acquisition
complete (mach_down)
heartbeat[2261]: 2008/12/16_02:26:20 WARN: glib: TTY write timeout on
[/dev/ttyS1] (no connection or bad cable? [see documentation])
heartbeat[2261]: 2008/12/16_02:26:20 info: glib: See
http://linux-ha.org/FAQ#TTYtimeout for details
heartbeat[2204]: 2008/12/16_02:26:29 info: Local Resource acquisition
completed. (none)
heartbeat[2204]: 2008/12/16_02:26:29 info: local resource transition
completed.
hb_standby[2812]: 2008/12/16_02:26:49 Going standby [foreign].
heartbeat[2204]: 2008/12/16_02:26:50 info: secondary wants to go standby
[foreign]
heartbeat[2204]: 2008/12/16_02:27:00 WARN: No reply to standby request.
Standby request cancelled.
Thanks for your time and additional assistance.
Scott
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems