Re: [Linux-HA] secondary does not replace primary if it's booted while primary is off.

Dejan Muhamedagic Tue, 16 Dec 2008 05:34:34 -0800

Hi,

On Tue, Dec 16, 2008 at 02:43:19AM -0700, Scott Edwards wrote:
> I've tested a few different scenarios, and there's one that's got me
> perplexed.  I first started reading http://www.linuxjournal.com/article/5862.
> I've used the boiler plate docs in
> /usr/share/doc/heartbeat-2/GettingStarted.html, and Google to search the
> mailing lists in an attempt to understand how to operate HA the way I would
> hope and expect it to run.
> 
> I'm testing this on two fairly similar pieces of hardware. (no
> virtuilziation etc). System specs are at least P2Ghz 512MBram.  As you'll
> see below, I'm using a null modem cable too.  I'm testing this to prove the
> setup before it's deployed.  For the time being I'm only using a lean web
> server just to have some service to probe and test on.
> 
> Expected: (at least by me)
> 
> 1. When only the primary is booted, it takes the resources just fine.
> 2. When both systems are running, and primary is active, secondary can be
> shutdown (init 0), abrubtly shutdown (sysrq u, s, o) unmount sync off. and
> primary stays active.
> 3. When secondary is active, primary takes over as soon as it can.
> 
> Unexpected: (again, my perspective)
> 
> When the primary is off, and the secondary is booted, it will not take
> resources.
> 
> 1. primary: init 0
> 2. secondary: init 6
> 
> After these steps, I want the secondary (even after 20 seconds or so) to
> jump up and assume the active role..  My continuous ping shows fifteen
> minutes and counting. I don't think secondary will become active (
> master.example.com).
> 
> Here are the related config files mentioned in the FAQ. (and others)
> 
> The systems are running Debian Etch.
> 
> secondary:~# dpkg -l heartb\*
> Desired=Unknown/Install/Remove/Purge/Hold
> | Status=Not/Installed/Config-files/Unpacked/Failed-config/Half-installed
> |/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err:
> uppercase=bad)
> ||/ Name                        Version                     Description
> +++-===========================-===========================-======================================================================
> un  heartbeat                   <none>                      (no description
> available)
> ii  heartbeat-2                 2.0.7-2                     Subsystem for
> High-Availability Linux
> 
> secondary:~# cat /etc/ha.d/ha.cf
> serial          /dev/ttyS1
> watchdog        /dev/watchdog
> debugfile       /var/log/ha-debug
> logfile         /var/log/ha-log
> logfacility     local0
> keepalive       2
> deadtime        10
> udpport         694
> bcast           eth0
> node            primary
> node            secondary
> ping            10.141.0.1
> auto_failback   on
> 
> secondary:~# cat /etc/ha.d/haresources
> primary 10.141.2.7 nginx
> 
> secondary:~# cat /etc/hosts | sed s/not-important/example/g
> 127.0.0.1       localhost
> 10.141.0.1      router.example.com        router
> 10.141.2.7      master.example.com      master
> 10.141.2.8      primary.example.com     primary
> 10.141.2.9      secondary.example.com   secondary
> 
> # The following lines are desirable for IPv6 capable hosts
> ::1     ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
> ff02::3 ip6-allhosts
> 
> secondary:~# uname -n
> secondary
> secondary:~# hostname
> secondary
> secondary:~# hostname -f | sed s/not-important/example/
> secondary.example.com
> secondary:~# cat /etc/resolv.conf
> search example.com
> nameserver 10.141.0.1
> 
> secondary:~# ifconfig -a
> eth0      Link encap:Ethernet  HWaddr 00:E0:18:BE:0E:51
>           inet addr:10.141.2.9  Bcast:10.141.7.255  Mask:255.255.248.0
>           inet6 addr: fe80::2e0:18ff:febe:e51/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:3623 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:2784 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:663897 (648.3 KiB)  TX bytes:827959 (808.5 KiB)
> 
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:2 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:168 (168.0 b)  TX bytes:168 (168.0 b)
> 
> On boot, it will take the active role for just a few seconds,
> 
> >From 10.141.10.1 icmp_seq=25263 Destination Host Unreachable
> 64 bytes from 10.141.2.7: icmp_seq=25264 ttl=63 time=539 ms
> 64 bytes from 10.141.2.7: icmp_seq=25265 ttl=63 time=0.741 ms
> 64 bytes from 10.141.2.7: icmp_seq=25266 ttl=63 time=0.764 ms
> >From 10.141.10.1 icmp_seq=25274 Destination Host Unreachable
> 
> For this post, I shutdown heartbeat, rotated the logs, and rebooted. (to
> reduce extra logs).  If they're really needed, I can attach them in a follow
> up reply.
> 
> secondary:~# cat /var/log/ha-log
> heartbeat[2203]: 2008/12/16_02:25:51 WARN: Logging daemon is disabled
> --enabling logging daemon is recommended
> heartbeat[2203]: 2008/12/16_02:25:51 info: **************************
> heartbeat[2203]: 2008/12/16_02:25:51 info: Configuration validated. Starting
> heartbeat 2.0.7
> heartbeat[2204]: 2008/12/16_02:25:51 info: heartbeat: version 2.0.7
> heartbeat[2204]: 2008/12/16_02:25:58 info: Heartbeat generation: 8
> heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_TriggerHandler: Added
> signal manual handler
> heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_TriggerHandler: Added
> signal manual handler
> heartbeat[2204]: 2008/12/16_02:25:58 info: Removing
> /var/run/heartbeat/rsctmp failed, recreating.
> heartbeat[2204]: 2008/12/16_02:25:58 info: glib: Starting serial heartbeat
> on tty /dev/ttyS1 (19200 baud)
> heartbeat[2204]: 2008/12/16_02:25:58 info: glib: UDP Broadcast heartbeat
> started on port 694 (694) interface eth0
> heartbeat[2204]: 2008/12/16_02:25:58 info: glib: UDP Broadcast heartbeat
> closed on port 694 interface eth0 - Status: 1
> heartbeat[2204]: 2008/12/16_02:25:58 info: glib: ping heartbeat started.
> heartbeat[2204]: 2008/12/16_02:25:58 notice: Using watchdog device:
> /dev/watchdog
> heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_SignalHandler: Added
> signal handler for signal 17
> heartbeat[2204]: 2008/12/16_02:25:58 info: Local status now set to: 'up'
> heartbeat[2204]: 2008/12/16_02:25:59 info: Link 10.141.0.1:10.141.0.1 up.
> heartbeat[2204]: 2008/12/16_02:25:59 info: Status update for node 10.141.0.1:
> status ping
> heartbeat[2204]: 2008/12/16_02:25:59 info: Link secondary:eth0 up.
> heartbeat[2204]: 2008/12/16_02:26:18 WARN: node primary: is dead
> heartbeat[2204]: 2008/12/16_02:26:18 info: Comm_now_up(): updating status to
> active
> heartbeat[2204]: 2008/12/16_02:26:18 info: Local status now set to: 'active'
> heartbeat[2204]: 2008/12/16_02:26:18 WARN: No STONITH device configured.
> heartbeat[2204]: 2008/12/16_02:26:18 WARN: Shared disks are not protected.
> heartbeat[2204]: 2008/12/16_02:26:18 info: Resources being acquired from
> primary.
> harc[2272]:     2008/12/16_02:26:18 info: Running /etc/ha.d/rc.d/status
> status
> heartbeat[2273]: 2008/12/16_02:26:18 info: No local resources
> [/usr/lib/heartbeat/ResourceManager listkeys secondary] to acquire.
> mach_down[2292]:        2008/12/16_02:26:19 info: Taking over resource group
> 10.141.2.7
> ResourceManager[2312]:  2008/12/16_02:26:19 info: Acquiring resource group:
> primary 10.141.2.7 nginx
> IPaddr[2336]:   2008/12/16_02:26:19 INFO: IPaddr Resource is stopped
> ResourceManager[2312]:  2008/12/16_02:26:19 info: Running
> /etc/ha.d/resource.d/IPaddr 10.141.2.7 start
> IPaddr[2513]:   2008/12/16_02:26:19 INFO: eval /sbin/ifconfig eth0:0
> 10.141.2.7 netmask 255.255.248.0 broadcast 10.141.7.255
> IPaddr[2513]:   2008/12/16_02:26:19 INFO: Sending Gratuitous Arp for
> 10.141.2.7 on eth0:0 [eth0]
> IPaddr[2513]:   2008/12/16_02:26:19 INFO: /usr/lib/heartbeat/send_arp -i 500
> -r 10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.141.2.7 eth0
> 10.141.2.7 auto 10.141.2.7 ffffffffffff
> IPaddr[2443]:   2008/12/16_02:26:19 INFO: IPaddr Success
> ResourceManager[2312]:  2008/12/16_02:26:19 info: Running /etc/init.d/nginx
> start
> ResourceManager[2312]:  2008/12/16_02:26:19 ERROR: Return code 1 from
> /etc/init.d/nginx
> ResourceManager[2312]:  2008/12/16_02:26:19 CRIT: Giving up resources due to
> failure of nginx


Here's one problem thay you need to fix. All resource agents
should behave before you start building the cluster. Please see
http://www.linux-ha.org/ResourceAgent

Thanks,

Dejan

> ResourceManager[2312]:  2008/12/16_02:26:19 info: Releasing resource group:
> primary 10.141.2.7 nginx
> ResourceManager[2312]:  2008/12/16_02:26:19 info: Running /etc/init.d/nginx
> stop
> ResourceManager[2312]:  2008/12/16_02:26:19 info: Running
> /etc/ha.d/resource.d/IPaddr 10.141.2.7 stop
> IPaddr[2750]:   2008/12/16_02:26:19 INFO: /sbin/route -n del -host
> 10.141.2.7
> IPaddr[2750]:   2008/12/16_02:26:19 INFO: /sbin/ifconfig eth0:0 10.141.2.7
> down
> IPaddr[2750]:   2008/12/16_02:26:19 INFO: IP Address 10.141.2.7 released
> IPaddr[2680]:   2008/12/16_02:26:19 INFO: IPaddr Success
> mach_down[2292]:        2008/12/16_02:26:19 info:
> /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired
> mach_down[2292]:        2008/12/16_02:26:19 info: mach_down takeover
> complete for node primary.
> heartbeat[2204]: 2008/12/16_02:26:19 info: mach_down takeover complete.
> heartbeat[2204]: 2008/12/16_02:26:19 info: Initial resource acquisition
> complete (mach_down)
> heartbeat[2261]: 2008/12/16_02:26:20 WARN: glib: TTY write timeout on
> [/dev/ttyS1] (no connection or bad cable? [see documentation])
> heartbeat[2261]: 2008/12/16_02:26:20 info: glib: See
> http://linux-ha.org/FAQ#TTYtimeout for details
> heartbeat[2204]: 2008/12/16_02:26:29 info: Local Resource acquisition
> completed. (none)
> heartbeat[2204]: 2008/12/16_02:26:29 info: local resource transition
> completed.
> hb_standby[2812]:       2008/12/16_02:26:49 Going standby [foreign].
> heartbeat[2204]: 2008/12/16_02:26:50 info: secondary wants to go standby
> [foreign]
> heartbeat[2204]: 2008/12/16_02:27:00 WARN: No reply to standby request.
> Standby request cancelled.
> 
> 
> 
> 
> Thanks for your time and additional assistance.
> 
> 
> Scott
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] secondary does not replace primary if it's booted while primary is off.

Reply via email to