No ideas on this, anyone? ;(
-------------- Hello, I am using DRBD 0.7 (master + slave config) + heartbeat + debian etch. I've been using the same setup with sarge, without issue, for about a year and a half. Anyhow, after my upgrade to etch, and a few minor scripting changes, I noticed that my boxes were not failing over correctly to the slave when the master was rebooted. Everything works fine if I just pull the plug, but during a controlled reboot of the master, the slave had problems. On further investigation, I noticed that the slave was attempting a takeover twice. Once when the master box started the reboot process (and in doing so, Debian scripts informed the slave of the reboot, and the takeover started). Then, when the reboot happened, heartbeat on the slave noticed the main box was gone, and started a second takeover attempt. Some logs of interest are attached... Now, I know that all scripts must be written so that multiple takeover attempts will not cause problems, and I've complied with that. My slave box now takes over fine, even if a double takeover attempt happens on it. Something odd happens when using the version of heartbeat in etch (1.2.5). I've seen my network interface (that heartbeat uses to communicate between the boxes, with a crossover cable) drop packets all over the place after the second takeover. It happened repeated times, and I have not yet been able to reproduce this behaviour with Debian's oldstable 1.2.3. Also using the version of heartbeat in etch, I've it skip a step or two on a release. The logs show it missing a step (although not the logs attached... ). Anyhow, any ideas from my logs, as to why the second takeover? As well, any ideas at all about heartbeat borking the interface? It doesn't really make any sense as to how heartbeat could cause the problem, but there it is. Pings are dropped all over the place, and heartbeat can no longer effectively communicate on that interface...
heartbeat: 2007/05/05_18:55:14 info: Received shutdown notice from 'masterbox.domain'. heartbeat: 2007/05/05_18:55:14 info: Resources being acquired from masterbox.domain. heartbeat: 2007/05/05_18:55:14 info: acquire all HA resources (standby). heartbeat: 2007/05/05_18:55:14 info: Acquiring resource group: slavebox.domain drbddisk::r0 Filesystem::/dev/drbd0::/mnt/nfsraid::ext3::noatime killnfsd sleep::2 nfs-common nfs-kernel-server mysql sleep::6 IPaddr::x.x.x.45/24/eth1 IPaddr::y.y.y.45/24/eth1 heartbeat: 2007/05/05_18:55:14 info: Running /etc/ha.d/resource.d/drbddisk r0 start heartbeat: 2007/05/05_18:55:14 info: Local Resource acquisition completed. heartbeat: 2007/05/05_18:55:14 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/nfsraid ext3 noatime start heartbeat: 2007/05/05_18:55:24 info: Running /etc/ha.d/resource.d/killnfsd start heartbeat: 2007/05/05_18:55:25 WARN: node masterbox.domain: is dead heartbeat: 2007/05/05_18:55:25 info: Dead node masterbox.domain gave up resources. heartbeat: 2007/05/05_18:55:25 info: Link masterbox.domain:eth0 dead. heartbeat: 2007/05/05_18:55:35 info: Running /etc/ha.d/resource.d/sleep 2 start heartbeat: 2007/05/05_18:55:37 info: Running /etc/ha.d/resource.d/nfs-common start heartbeat: 2007/05/05_18:55:37 info: Running /etc/ha.d/resource.d/nfs-kernel-server start heartbeat: 2007/05/05_18:55:38 info: Running /etc/ha.d/resource.d/mysql start heartbeat: 2007/05/05_18:55:45 info: Running /etc/ha.d/resource.d/sleep 6 start heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/resource.d/IPaddr x.x.x.45/24/eth1 start heartbeat: 2007/05/05_18:55:52 info: /sbin/ifconfig eth1:0 x.x.x.45 netmask 255.255.255.0 broadcast x.x.x.255 heartbeat: 2007/05/05_18:55:52 info: Sending Gratuitous Arp for x.x.x.45 on eth1:0 [eth1] heartbeat: 2007/05/05_18:55:52 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-x.x.x.45 eth1 x.x.x.45 auto x.x.x.45 ffffffffffff heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/resource.d/IPaddr y.y.y.45/24/eth1 start heartbeat: 2007/05/05_18:55:52 info: /sbin/ifconfig eth1:2 y.y.y.45 netmask 255.255.255.0 broadcast y.y.y.255 heartbeat: 2007/05/05_18:55:52 info: Sending Gratuitous Arp for y.y.y.45 on eth1:2 [eth1] heartbeat: 2007/05/05_18:55:52 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-y.y.y.45 eth1 y.y.y.45 auto y.y.y.45 ffffffffffff heartbeat: 2007/05/05_18:55:52 info: all HA resource acquisition completed (standby). heartbeat: 2007/05/05_18:55:52 ERROR: Ignored standby message 'done' from slavebox.domain in state 0 heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/rc.d/status status heartbeat: 2007/05/05_18:55:52 info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired heartbeat: 2007/05/05_18:55:52 info: mach_down takeover complete. heartbeat: 2007/05/05_18:55:52 info: mach_down takeover complete for node masterbox.domain. heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp heartbeat: 2007/05/05_18:55:52 received ip-request-resp drbddisk::r0 OK yes heartbeat: 2007/05/05_18:55:52 info: Acquiring resource group: slavebox.domain drbddisk::r0 Filesystem::/dev/drbd0::/mnt/nfsraid::ext3::noatime killnfsd sleep::2 nfs-common nfs-kernel-server mysql sleep::6 IPaddr::x.x.x.45/24/eth1 IPaddr::y.y.y.45/24/eth1 heartbeat: 2007/05/05_18:56:01 info: Running /etc/ha.d/resource.d/killnfsd start heartbeat: 2007/05/05_18:56:12 info: Running /etc/ha.d/resource.d/sleep 2 start heartbeat: 2007/05/05_18:56:14 info: Running /etc/ha.d/resource.d/nfs-common start heartbeat: 2007/05/05_18:56:14 info: Running /etc/ha.d/resource.d/nfs-kernel-server start heartbeat: 2007/05/05_18:56:14 info: Running /etc/ha.d/resource.d/mysql start heartbeat: 2007/05/05_18:56:21 info: Running /etc/ha.d/resource.d/sleep 6 start
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
