Hi,

On Tue, Feb 24, 2009 at 04:29:14PM +0000, Mark Watts wrote:
> 
> On Tuesday 24 February 2009 16:11:25 Rick Ennis wrote:
> > I posted the "unintentional failover" message a week or so ago and no one
> > had any ideas.  I think I've figured it out and thought I'd post back in
> > case it helps anyone else.  Or so you guys can tell me if I'm nuts and my
> > conclusions are all wrong.  It turns out the gory details do matter in this
> > case so here's a quick rundown of what we had...
> >
> > 2 node cluster running drbd and heartbeat as an NFS server.  I ignored drbd
> > because it appeared to work flawlessly.
> >
> > While our primary was humming along serving NFS our secondary suffered a
> > SCSI error that effectively froze the kernel for about 4 minutes.  When the
> > secondary woke back up, checking the clock, it realized it hadn't heard
> > from the primary in that amount of time.  
> 
> If the box A is unresponsive for 4 minutes, and the box B _hasn't_ already 
> fenced it, then you need to re-tune your timeouts so that happens.

Yes. Fencing is the only way. I suppose that there was no
fencing/stonith in place at the time of the incident.

> Self-checking scripts are great, but if the kernel never schedules them (for 
> whatever unlikely reason) then its far better for the healthy node to make 
> the fencing decision, and for you to use a hardware fencing device (iLO/ 
> DRAC/LOM/power etc).

Which is why self-checking scripts, or whatever that relies on
the state of the host, are not a good solution.

Thanks,

Dejan


> 
> Mark.
> 
> -- 
> Mark Watts BSc RHCE MBCS
> Senior Systems Engineer
> QinetiQ Applied Technologies
> GPG Key: http://www.linux-corner.info/mwatts.gpg



> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to