Hi, On Tue, Feb 24, 2009 at 04:29:14PM +0000, Mark Watts wrote: > > On Tuesday 24 February 2009 16:11:25 Rick Ennis wrote: > > I posted the "unintentional failover" message a week or so ago and no one > > had any ideas. I think I've figured it out and thought I'd post back in > > case it helps anyone else. Or so you guys can tell me if I'm nuts and my > > conclusions are all wrong. It turns out the gory details do matter in this > > case so here's a quick rundown of what we had... > > > > 2 node cluster running drbd and heartbeat as an NFS server. I ignored drbd > > because it appeared to work flawlessly. > > > > While our primary was humming along serving NFS our secondary suffered a > > SCSI error that effectively froze the kernel for about 4 minutes. When the > > secondary woke back up, checking the clock, it realized it hadn't heard > > from the primary in that amount of time. > > If the box A is unresponsive for 4 minutes, and the box B _hasn't_ already > fenced it, then you need to re-tune your timeouts so that happens.
Yes. Fencing is the only way. I suppose that there was no fencing/stonith in place at the time of the incident. > Self-checking scripts are great, but if the kernel never schedules them (for > whatever unlikely reason) then its far better for the healthy node to make > the fencing decision, and for you to use a hardware fencing device (iLO/ > DRAC/LOM/power etc). Which is why self-checking scripts, or whatever that relies on the state of the host, are not a good solution. Thanks, Dejan > > Mark. > > -- > Mark Watts BSc RHCE MBCS > Senior Systems Engineer > QinetiQ Applied Technologies > GPG Key: http://www.linux-corner.info/mwatts.gpg > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
