I posted the "unintentional failover" message a week or so ago and no one had any ideas. I think I've figured it out and thought I'd post back in case it helps anyone else. Or so you guys can tell me if I'm nuts and my conclusions are all wrong. It turns out the gory details do matter in this case so here's a quick rundown of what we had...
2 node cluster running drbd and heartbeat as an NFS server. I ignored drbd because it appeared to work flawlessly. While our primary was humming along serving NFS our secondary suffered a SCSI error that effectively froze the kernel for about 4 minutes. When the secondary woke back up, checking the clock, it realized it hadn't heard from the primary in that amount of time. So it wanted to take over. This all happened very quickly as networking between the two boxes came back online and suddenly you had a split brain moment. The primary restarted heartbeat (correctly) at that point and upon coming back saw the secondary as still in control. So it stayed out of the mix. That would have been a successful failover (for no good reason) except that the secondary never really took over. It was hosed because, as with most SCSI errors, the system remounts the disk as read-only. So heartbeat on the secondary wouldn't complete the take over. To zoom back out to the big picture this is actually a pretty interesting case. The secondary, due to this weird freak hardware condition, can't take over. But because the heartbeat procs are still running, is able to convey that it's available, ready, and in the case of the split brain moment we experienced, active. So what I think I've learned is that heartbeat is amazing for making a system take over when another box dies. But when the other box is in a weird state, unable to perform, but still running heartbeat procs, you enter into the unknown. In my case it turned out to even be dangerous because this fluke condition (which has happened to me twice now) causes the secondary, when it unfreezes from the SCSI error and is then read-only, to effectively convince the primary that it should stop serving. The primary lets go of the IP, the secondary can't take it over, and suddenly you have a service outage. We tried doctoring the resource scripts on the secondary to add additional checks when it attempts to go active. We thought a simple "can you write to the disk" check would allow us to have the secondary bomb out in its attempt to take over. But it turns out that running heartbeat procs with an underlying read-only filesystem is so deranged of a combination that all bets are off. It doesn't even successfully run the check scripts in those cases. Who knows why.. it can't log (/var/log). It can't create temp files (/tmp). So it gives up. In the end our solution is something analogous to STONITH except instead of The Other Node it becomes shoot thyself. The check script testing illustrated that we couldn't rely on heartbeat executing anything in that specific hosed state. So we needed a process that was already running. In true band-aid fashion we wrote an idiot simple perl daemon that sleeps most of the time, wakes up every 15 seconds and checks to see if the root filesystem is read-only. If it is, the script kills all the heartbeat procs. At that point heartbeat on the other box finds itself in the familiar situation of "my partner has died" and knows exactly what to do. This covers the case of a hardware failure in a machine, but where the box doesn't completely disappear (it can still answer heartbeat pings). To the best of my knowledge, heartbeat was probably never intended to cover this corner case. -- Rick Ennis Sr. Manager, Technical Operations Public Interactive _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
