Hi, On Tue, Dec 16, 2008 at 02:43:19AM -0700, Scott Edwards wrote: > I've tested a few different scenarios, and there's one that's got me > perplexed. I first started reading http://www.linuxjournal.com/article/5862. > I've used the boiler plate docs in > /usr/share/doc/heartbeat-2/GettingStarted.html, and Google to search the > mailing lists in an attempt to understand how to operate HA the way I would > hope and expect it to run. > > I'm testing this on two fairly similar pieces of hardware. (no > virtuilziation etc). System specs are at least P2Ghz 512MBram. As you'll > see below, I'm using a null modem cable too. I'm testing this to prove the > setup before it's deployed. For the time being I'm only using a lean web > server just to have some service to probe and test on. > > Expected: (at least by me) > > 1. When only the primary is booted, it takes the resources just fine. > 2. When both systems are running, and primary is active, secondary can be > shutdown (init 0), abrubtly shutdown (sysrq u, s, o) unmount sync off. and > primary stays active. > 3. When secondary is active, primary takes over as soon as it can. > > Unexpected: (again, my perspective) > > When the primary is off, and the secondary is booted, it will not take > resources. > > 1. primary: init 0 > 2. secondary: init 6 > > After these steps, I want the secondary (even after 20 seconds or so) to > jump up and assume the active role.. My continuous ping shows fifteen > minutes and counting. I don't think secondary will become active ( > master.example.com). > > Here are the related config files mentioned in the FAQ. (and others) > > The systems are running Debian Etch. > > secondary:~# dpkg -l heartb\* > Desired=Unknown/Install/Remove/Purge/Hold > | Status=Not/Installed/Config-files/Unpacked/Failed-config/Half-installed > |/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: > uppercase=bad) > ||/ Name Version Description > +++-===========================-===========================-====================================================================== > un heartbeat <none> (no description > available) > ii heartbeat-2 2.0.7-2 Subsystem for > High-Availability Linux > > secondary:~# cat /etc/ha.d/ha.cf > serial /dev/ttyS1 > watchdog /dev/watchdog > debugfile /var/log/ha-debug > logfile /var/log/ha-log > logfacility local0 > keepalive 2 > deadtime 10 > udpport 694 > bcast eth0 > node primary > node secondary > ping 10.141.0.1 > auto_failback on > > secondary:~# cat /etc/ha.d/haresources > primary 10.141.2.7 nginx > > secondary:~# cat /etc/hosts | sed s/not-important/example/g > 127.0.0.1 localhost > 10.141.0.1 router.example.com router > 10.141.2.7 master.example.com master > 10.141.2.8 primary.example.com primary > 10.141.2.9 secondary.example.com secondary > > # The following lines are desirable for IPv6 capable hosts > ::1 ip6-localhost ip6-loopback > fe00::0 ip6-localnet > ff00::0 ip6-mcastprefix > ff02::1 ip6-allnodes > ff02::2 ip6-allrouters > ff02::3 ip6-allhosts > > secondary:~# uname -n > secondary > secondary:~# hostname > secondary > secondary:~# hostname -f | sed s/not-important/example/ > secondary.example.com > secondary:~# cat /etc/resolv.conf > search example.com > nameserver 10.141.0.1 > > secondary:~# ifconfig -a > eth0 Link encap:Ethernet HWaddr 00:E0:18:BE:0E:51 > inet addr:10.141.2.9 Bcast:10.141.7.255 Mask:255.255.248.0 > inet6 addr: fe80::2e0:18ff:febe:e51/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:3623 errors:0 dropped:0 overruns:0 frame:0 > TX packets:2784 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:663897 (648.3 KiB) TX bytes:827959 (808.5 KiB) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:2 errors:0 dropped:0 overruns:0 frame:0 > TX packets:2 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:168 (168.0 b) TX bytes:168 (168.0 b) > > On boot, it will take the active role for just a few seconds, > > >From 10.141.10.1 icmp_seq=25263 Destination Host Unreachable > 64 bytes from 10.141.2.7: icmp_seq=25264 ttl=63 time=539 ms > 64 bytes from 10.141.2.7: icmp_seq=25265 ttl=63 time=0.741 ms > 64 bytes from 10.141.2.7: icmp_seq=25266 ttl=63 time=0.764 ms > >From 10.141.10.1 icmp_seq=25274 Destination Host Unreachable > > For this post, I shutdown heartbeat, rotated the logs, and rebooted. (to > reduce extra logs). If they're really needed, I can attach them in a follow > up reply. > > secondary:~# cat /var/log/ha-log > heartbeat[2203]: 2008/12/16_02:25:51 WARN: Logging daemon is disabled > --enabling logging daemon is recommended > heartbeat[2203]: 2008/12/16_02:25:51 info: ************************** > heartbeat[2203]: 2008/12/16_02:25:51 info: Configuration validated. Starting > heartbeat 2.0.7 > heartbeat[2204]: 2008/12/16_02:25:51 info: heartbeat: version 2.0.7 > heartbeat[2204]: 2008/12/16_02:25:58 info: Heartbeat generation: 8 > heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_TriggerHandler: Added > signal manual handler > heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_TriggerHandler: Added > signal manual handler > heartbeat[2204]: 2008/12/16_02:25:58 info: Removing > /var/run/heartbeat/rsctmp failed, recreating. > heartbeat[2204]: 2008/12/16_02:25:58 info: glib: Starting serial heartbeat > on tty /dev/ttyS1 (19200 baud) > heartbeat[2204]: 2008/12/16_02:25:58 info: glib: UDP Broadcast heartbeat > started on port 694 (694) interface eth0 > heartbeat[2204]: 2008/12/16_02:25:58 info: glib: UDP Broadcast heartbeat > closed on port 694 interface eth0 - Status: 1 > heartbeat[2204]: 2008/12/16_02:25:58 info: glib: ping heartbeat started. > heartbeat[2204]: 2008/12/16_02:25:58 notice: Using watchdog device: > /dev/watchdog > heartbeat[2204]: 2008/12/16_02:25:58 info: G_main_add_SignalHandler: Added > signal handler for signal 17 > heartbeat[2204]: 2008/12/16_02:25:58 info: Local status now set to: 'up' > heartbeat[2204]: 2008/12/16_02:25:59 info: Link 10.141.0.1:10.141.0.1 up. > heartbeat[2204]: 2008/12/16_02:25:59 info: Status update for node 10.141.0.1: > status ping > heartbeat[2204]: 2008/12/16_02:25:59 info: Link secondary:eth0 up. > heartbeat[2204]: 2008/12/16_02:26:18 WARN: node primary: is dead > heartbeat[2204]: 2008/12/16_02:26:18 info: Comm_now_up(): updating status to > active > heartbeat[2204]: 2008/12/16_02:26:18 info: Local status now set to: 'active' > heartbeat[2204]: 2008/12/16_02:26:18 WARN: No STONITH device configured. > heartbeat[2204]: 2008/12/16_02:26:18 WARN: Shared disks are not protected. > heartbeat[2204]: 2008/12/16_02:26:18 info: Resources being acquired from > primary. > harc[2272]: 2008/12/16_02:26:18 info: Running /etc/ha.d/rc.d/status > status > heartbeat[2273]: 2008/12/16_02:26:18 info: No local resources > [/usr/lib/heartbeat/ResourceManager listkeys secondary] to acquire. > mach_down[2292]: 2008/12/16_02:26:19 info: Taking over resource group > 10.141.2.7 > ResourceManager[2312]: 2008/12/16_02:26:19 info: Acquiring resource group: > primary 10.141.2.7 nginx > IPaddr[2336]: 2008/12/16_02:26:19 INFO: IPaddr Resource is stopped > ResourceManager[2312]: 2008/12/16_02:26:19 info: Running > /etc/ha.d/resource.d/IPaddr 10.141.2.7 start > IPaddr[2513]: 2008/12/16_02:26:19 INFO: eval /sbin/ifconfig eth0:0 > 10.141.2.7 netmask 255.255.248.0 broadcast 10.141.7.255 > IPaddr[2513]: 2008/12/16_02:26:19 INFO: Sending Gratuitous Arp for > 10.141.2.7 on eth0:0 [eth0] > IPaddr[2513]: 2008/12/16_02:26:19 INFO: /usr/lib/heartbeat/send_arp -i 500 > -r 10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.141.2.7 eth0 > 10.141.2.7 auto 10.141.2.7 ffffffffffff > IPaddr[2443]: 2008/12/16_02:26:19 INFO: IPaddr Success > ResourceManager[2312]: 2008/12/16_02:26:19 info: Running /etc/init.d/nginx > start > ResourceManager[2312]: 2008/12/16_02:26:19 ERROR: Return code 1 from > /etc/init.d/nginx > ResourceManager[2312]: 2008/12/16_02:26:19 CRIT: Giving up resources due to > failure of nginx
Here's one problem thay you need to fix. All resource agents should behave before you start building the cluster. Please see http://www.linux-ha.org/ResourceAgent Thanks, Dejan > ResourceManager[2312]: 2008/12/16_02:26:19 info: Releasing resource group: > primary 10.141.2.7 nginx > ResourceManager[2312]: 2008/12/16_02:26:19 info: Running /etc/init.d/nginx > stop > ResourceManager[2312]: 2008/12/16_02:26:19 info: Running > /etc/ha.d/resource.d/IPaddr 10.141.2.7 stop > IPaddr[2750]: 2008/12/16_02:26:19 INFO: /sbin/route -n del -host > 10.141.2.7 > IPaddr[2750]: 2008/12/16_02:26:19 INFO: /sbin/ifconfig eth0:0 10.141.2.7 > down > IPaddr[2750]: 2008/12/16_02:26:19 INFO: IP Address 10.141.2.7 released > IPaddr[2680]: 2008/12/16_02:26:19 INFO: IPaddr Success > mach_down[2292]: 2008/12/16_02:26:19 info: > /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired > mach_down[2292]: 2008/12/16_02:26:19 info: mach_down takeover > complete for node primary. > heartbeat[2204]: 2008/12/16_02:26:19 info: mach_down takeover complete. > heartbeat[2204]: 2008/12/16_02:26:19 info: Initial resource acquisition > complete (mach_down) > heartbeat[2261]: 2008/12/16_02:26:20 WARN: glib: TTY write timeout on > [/dev/ttyS1] (no connection or bad cable? [see documentation]) > heartbeat[2261]: 2008/12/16_02:26:20 info: glib: See > http://linux-ha.org/FAQ#TTYtimeout for details > heartbeat[2204]: 2008/12/16_02:26:29 info: Local Resource acquisition > completed. (none) > heartbeat[2204]: 2008/12/16_02:26:29 info: local resource transition > completed. > hb_standby[2812]: 2008/12/16_02:26:49 Going standby [foreign]. > heartbeat[2204]: 2008/12/16_02:26:50 info: secondary wants to go standby > [foreign] > heartbeat[2204]: 2008/12/16_02:27:00 WARN: No reply to standby request. > Standby request cancelled. > > > > > Thanks for your time and additional assistance. > > > Scott > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
