I posted the "unintentional failover" message a week or so ago and no one had 
any ideas.  I think I've figured it out and thought I'd post back in case it helps anyone 
else.  Or so you guys can tell me if I'm nuts and my conclusions are all wrong.  It turns 
out the gory details do matter in this case so here's a quick rundown of what we had...

2 node cluster running drbd and heartbeat as an NFS server.  I ignored drbd 
because it appeared to work flawlessly.

While our primary was humming along serving NFS our secondary suffered a SCSI 
error that effectively froze the kernel for about 4 minutes.  When the 
secondary woke back up, checking the clock, it realized it hadn't heard from 
the primary in that amount of time.  So it wanted to take over.  This all 
happened very quickly as networking between the two boxes came back online and 
suddenly you had a split brain moment.  The primary restarted heartbeat 
(correctly) at that point and upon coming back saw the secondary as still in 
control.  So it stayed out of the mix.  That would have been a successful 
failover (for no good reason) except that the secondary never really took over. 
 It was hosed because, as with most SCSI errors, the system remounts the disk 
as read-only.  So heartbeat on the secondary wouldn't complete the take over.

To zoom back out to the big picture this is actually a pretty interesting case. 
 The secondary, due to this weird freak hardware condition, can't take over.  
But because the heartbeat procs are still running, is able to convey that it's 
available, ready, and in the case of the split brain moment we experienced, 
active.  So what I think I've learned is that heartbeat is amazing for making a 
system take over when another box dies.  But when the other box is in a weird 
state, unable to perform, but still running heartbeat procs, you enter into the 
unknown.  In my case it turned out to even be dangerous because this fluke 
condition (which has happened to me twice now) causes the secondary, when it 
unfreezes from the SCSI error and is then read-only, to effectively convince 
the primary that it should stop serving.  The primary lets go of the IP, the 
secondary can't take it over, and suddenly you have a service outage.

We tried doctoring the resource scripts on the secondary to add additional checks when it 
attempts to go active.  We thought a simple "can you write to the disk" check 
would allow us to have the secondary bomb out in its attempt to take over.  But it turns 
out that running heartbeat procs with an underlying read-only filesystem is so deranged 
of a combination that all bets are off.  It doesn't even successfully run the check 
scripts in those cases.  Who knows why..  it can't log (/var/log).  It can't create temp 
files (/tmp).  So it gives up.

In the end our solution is something analogous to STONITH except instead of The Other 
Node it becomes shoot thyself.  The check script testing illustrated that we couldn't 
rely on heartbeat executing anything in that specific hosed state.  So we needed a 
process that was already running.  In true band-aid fashion we wrote an idiot simple perl 
daemon that sleeps most of the time, wakes up every 15 seconds and checks to see if the 
root filesystem is read-only.  If it is, the script kills all the heartbeat procs.  At 
that point heartbeat on the other box finds itself in the familiar situation of "my 
partner has died" and knows exactly what to do.  This covers the case of a hardware 
failure in a machine, but where the box doesn't completely disappear (it can still answer 
heartbeat pings).  To the best of my knowledge, heartbeat was probably never intended to 
cover this corner case.

--
Rick Ennis
Sr. Manager, Technical Operations
Public Interactive
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to