On Tuesday 24 February 2009 16:11:25 Rick Ennis wrote: > I posted the "unintentional failover" message a week or so ago and no one > had any ideas. I think I've figured it out and thought I'd post back in > case it helps anyone else. Or so you guys can tell me if I'm nuts and my > conclusions are all wrong. It turns out the gory details do matter in this > case so here's a quick rundown of what we had... > > 2 node cluster running drbd and heartbeat as an NFS server. I ignored drbd > because it appeared to work flawlessly. > > While our primary was humming along serving NFS our secondary suffered a > SCSI error that effectively froze the kernel for about 4 minutes. When the > secondary woke back up, checking the clock, it realized it hadn't heard > from the primary in that amount of time.
If the box A is unresponsive for 4 minutes, and the box B _hasn't_ already fenced it, then you need to re-tune your timeouts so that happens. Self-checking scripts are great, but if the kernel never schedules them (for whatever unlikely reason) then its far better for the healthy node to make the fencing decision, and for you to use a hardware fencing device (iLO/ DRAC/LOM/power etc). Mark. -- Mark Watts BSc RHCE MBCS Senior Systems Engineer QinetiQ Applied Technologies GPG Key: http://www.linux-corner.info/mwatts.gpg
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
