Nate Reed wrote:
Here's a strange situation that has me worried about whether Heartbeat is doing the correct thing to failover DRBD and related resources. I know nothing strange ever happens in the real world, but...

As a test to see if I could break the cluster, I maliciously yanked out the RAID disks on the active node. As a result, there was a kernel panic on both machines. To recover, I had to reboot them both (and reinstall the OS on the first machine). I was not able to bring up the second machine as the primary ("drbdadm primary all", also "drbddisk" does the same thing I believe, so Heartbeat wouldn't start up), and all the data on the drbd partition was lost.

Have you seen anything like this happen before?
We are a little concerned about drbd/heartbeat handling this situation
properly.  Ideally, the second node would be able to take over the drbd
resource, make itself primary and be able to access the data as usual. Is it possible that the devices were inconsistent and in the process of
synchronizing, when Heartbeat tried to do "drbdadm primary all" on the
second machine?  Would this cause the panic situtiation?  A similar
situation is described on drbd.org:

"Note that even though you can do so, it is a bad idea to have the
SyncTarget be Primary, it will panic on a network failure, because it
then has no more access to good data ..."

Did Heartbeat cause the panic?  It seems that the best thing Heartbeat
could do in this situation would be to shutdown completely.  Comments?

Eventually, we're going to try to reproduce this error, but it takes a
lot of effort to setup this test and then rebuild the machines, so we
will not be able to provide much more information at this time.

When using it with DRBD, we typically recommend that you have resource stickiness enabled for DRBD resources, which is roughly the same as auto_failback off.

How did you have it configured?

Heartbeat doesn't panic the machine. But, it will reboot it ungracefully if it has a resource that refuses to stop.

You probably want to enable ko_count in DRBD as well, and check if you told it to panic in this situation...


--
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to