On Tuesday 24 February 2009 16:11:25 Rick Ennis wrote:
> I posted the "unintentional failover" message a week or so ago and no one
> had any ideas.  I think I've figured it out and thought I'd post back in
> case it helps anyone else.  Or so you guys can tell me if I'm nuts and my
> conclusions are all wrong.  It turns out the gory details do matter in this
> case so here's a quick rundown of what we had...
>
> 2 node cluster running drbd and heartbeat as an NFS server.  I ignored drbd
> because it appeared to work flawlessly.
>
> While our primary was humming along serving NFS our secondary suffered a
> SCSI error that effectively froze the kernel for about 4 minutes.  When the
> secondary woke back up, checking the clock, it realized it hadn't heard
> from the primary in that amount of time.  

If the box A is unresponsive for 4 minutes, and the box B _hasn't_ already 
fenced it, then you need to re-tune your timeouts so that happens.

Self-checking scripts are great, but if the kernel never schedules them (for 
whatever unlikely reason) then its far better for the healthy node to make 
the fencing decision, and for you to use a hardware fencing device (iLO/ 
DRAC/LOM/power etc).

Mark.

-- 
Mark Watts BSc RHCE MBCS
Senior Systems Engineer
QinetiQ Applied Technologies
GPG Key: http://www.linux-corner.info/mwatts.gpg

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to