Here's a strange situation that has me worried about whether Heartbeat is 
doing the correct thing to failover DRBD and related resources.  I know 
nothing strange ever happens in the real world, but...

As a test to see if I could break the cluster, I maliciously yanked out the 
RAID disks on the active node. As a result, there was a kernel panic on both 
machines.  To recover, I had to reboot them both (and reinstall the OS on the 
first machine).  I was not able to bring up the second machine as the primary 
("drbdadm primary all", also "drbddisk" does the same thing I believe, so 
Heartbeat wouldn't start up), and all the data on the drbd partition was 
lost.

Have you seen anything like this happen before?  

We are a little concerned about drbd/heartbeat handling this situation
properly.  Ideally, the second node would be able to take over the drbd
resource, make itself primary and be able to access the data as usual. 
Is it possible that the devices were inconsistent and in the process of
synchronizing, when Heartbeat tried to do "drbdadm primary all" on the
second machine?  Would this cause the panic situtiation?  A similar
situation is described on drbd.org:

"Note that even though you can do so, it is a bad idea to have the
SyncTarget be Primary, it will panic on a network failure, because it
then has no more access to good data ..."

Did Heartbeat cause the panic?  It seems that the best thing Heartbeat
could do in this situation would be to shutdown completely.  Comments?

Eventually, we're going to try to reproduce this error, but it takes a
lot of effort to setup this test and then rebuild the machines, so we
will not be able to provide much more information at this time.

Thanks,
Nate
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to