No ideas on this, anyone? ;(


--------------

Hello,

I am using DRBD 0.7 (master + slave config) + heartbeat + debian etch. 
I've been using the same setup with sarge, without issue, for about a year
and a half.

Anyhow, after my upgrade to etch, and a few minor scripting changes, I
noticed that my boxes were not failing over correctly to the slave when
the master was rebooted. Everything works fine if I just pull the plug,
but during a controlled reboot of the master, the slave had problems.

On further investigation, I noticed that the slave was attempting a
takeover twice.  Once when the master box started the reboot process (and
in doing so, Debian scripts informed the slave of the reboot, and the
takeover started).  Then, when the reboot happened, heartbeat on the slave
noticed the main box was gone, and started a second takeover attempt.

Some logs of interest are attached...

Now, I know that all scripts must be written so that multiple takeover
attempts will not cause problems, and I've complied with that. 
My slave box now takes over fine, even if a double takeover
attempt happens on it.

Something odd happens when using the version of heartbeat in etch (1.2.5).
 I've seen my network interface (that heartbeat uses to communicate
between the boxes, with a crossover cable) drop packets all over the place
after the second takeover.  It happened repeated times, and I have not yet
been able to reproduce this behaviour with Debian's oldstable 1.2.3.

Also using the version of heartbeat in etch, I've it skip a step or two
on a release.  The logs show it missing a step (although not the logs
attached... ).

Anyhow, any ideas from my logs, as to why the second takeover?  As well,
any ideas at all about heartbeat borking the interface?  It doesn't really
make any sense as to how heartbeat could cause the problem, but there it
is.  Pings are dropped all over the place, and heartbeat can no longer
effectively communicate on that interface...

heartbeat: 2007/05/05_18:55:14 info: Received shutdown notice from 
'masterbox.domain'.
heartbeat: 2007/05/05_18:55:14 info: Resources being acquired from 
masterbox.domain.
heartbeat: 2007/05/05_18:55:14 info: acquire all HA resources (standby).
heartbeat: 2007/05/05_18:55:14 info: Acquiring resource group: slavebox.domain 
drbddisk::r0 Filesystem::/dev/drbd0::/mnt/nfsraid::ext3::noatime killnfsd 
sleep::2 nfs-common nfs-kernel-server mysql sleep::6 IPaddr::x.x.x.45/24/eth1 
IPaddr::y.y.y.45/24/eth1
heartbeat: 2007/05/05_18:55:14 info: Running /etc/ha.d/resource.d/drbddisk r0 
start
heartbeat: 2007/05/05_18:55:14 info: Local Resource acquisition completed.
heartbeat: 2007/05/05_18:55:14 info: Running /etc/ha.d/resource.d/Filesystem 
/dev/drbd0 /mnt/nfsraid ext3 noatime start
heartbeat: 2007/05/05_18:55:24 info: Running /etc/ha.d/resource.d/killnfsd  
start
heartbeat: 2007/05/05_18:55:25 WARN: node masterbox.domain: is dead
heartbeat: 2007/05/05_18:55:25 info: Dead node masterbox.domain gave up 
resources.
heartbeat: 2007/05/05_18:55:25 info: Link masterbox.domain:eth0 dead.
heartbeat: 2007/05/05_18:55:35 info: Running /etc/ha.d/resource.d/sleep 2 start
heartbeat: 2007/05/05_18:55:37 info: Running /etc/ha.d/resource.d/nfs-common  
start
heartbeat: 2007/05/05_18:55:37 info: Running 
/etc/ha.d/resource.d/nfs-kernel-server  start
heartbeat: 2007/05/05_18:55:38 info: Running /etc/ha.d/resource.d/mysql  start
heartbeat: 2007/05/05_18:55:45 info: Running /etc/ha.d/resource.d/sleep 6 start
heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/resource.d/IPaddr 
x.x.x.45/24/eth1 start
heartbeat: 2007/05/05_18:55:52 info: /sbin/ifconfig eth1:0 x.x.x.45 netmask 
255.255.255.0    broadcast x.x.x.255
heartbeat: 2007/05/05_18:55:52 info: Sending Gratuitous Arp for x.x.x.45 on 
eth1:0 [eth1]
heartbeat: 2007/05/05_18:55:52 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p 
/var/lib/heartbeat/rsctmp/send_arp/send_arp-x.x.x.45 eth1 x.x.x.45 auto 
x.x.x.45 ffffffffffff
heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/resource.d/IPaddr 
y.y.y.45/24/eth1 start
heartbeat: 2007/05/05_18:55:52 info: /sbin/ifconfig eth1:2 y.y.y.45 netmask 
255.255.255.0   broadcast y.y.y.255
heartbeat: 2007/05/05_18:55:52 info: Sending Gratuitous Arp for y.y.y.45 on 
eth1:2 [eth1]
heartbeat: 2007/05/05_18:55:52 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p 
/var/lib/heartbeat/rsctmp/send_arp/send_arp-y.y.y.45 eth1 y.y.y.45 auto 
y.y.y.45 ffffffffffff
heartbeat: 2007/05/05_18:55:52 info: all HA resource acquisition completed 
(standby).
heartbeat: 2007/05/05_18:55:52 ERROR: Ignored standby message 'done' from 
slavebox.domain in state 0
heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2007/05/05_18:55:52 info: /usr/lib/heartbeat/mach_down: 
nice_failback: foreign resources acquired
heartbeat: 2007/05/05_18:55:52 info: mach_down takeover complete.
heartbeat: 2007/05/05_18:55:52 info: mach_down takeover complete for node 
masterbox.domain.
heartbeat: 2007/05/05_18:55:52 info: Running /etc/ha.d/rc.d/ip-request-resp 
ip-request-resp
heartbeat: 2007/05/05_18:55:52 received ip-request-resp drbddisk::r0 OK yes
heartbeat: 2007/05/05_18:55:52 info: Acquiring resource group: slavebox.domain 
drbddisk::r0 Filesystem::/dev/drbd0::/mnt/nfsraid::ext3::noatime killnfsd 
sleep::2 nfs-common nfs-kernel-server mysql sleep::6 IPaddr::x.x.x.45/24/eth1 
IPaddr::y.y.y.45/24/eth1
heartbeat: 2007/05/05_18:56:01 info: Running /etc/ha.d/resource.d/killnfsd  
start
heartbeat: 2007/05/05_18:56:12 info: Running /etc/ha.d/resource.d/sleep 2 start
heartbeat: 2007/05/05_18:56:14 info: Running /etc/ha.d/resource.d/nfs-common  
start
heartbeat: 2007/05/05_18:56:14 info: Running 
/etc/ha.d/resource.d/nfs-kernel-server  start
heartbeat: 2007/05/05_18:56:14 info: Running /etc/ha.d/resource.d/mysql  start
heartbeat: 2007/05/05_18:56:21 info: Running /etc/ha.d/resource.d/sleep 6 start

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to