Nate Reed wrote:
Here's a strange situation that has me worried about whether Heartbeat is doing the correct thing to failover DRBD and related resources. I know nothing strange ever happens in the real world, but...

Ha. Ever try to push an RM8000 array to it's SCSI bus limits?


As a test to see if I could break the cluster, I maliciously yanked out the RAID disks on the active node. As a result, there was a kernel panic on both machines.

Both machines panic-ed?
or did one panic and the other halt with a message of "!DRBD! pri on 
incon-degr"?

To recover, I had to reboot them both (and reinstall the OS on the first machine). I was not able to bring up the second machine as the primary ("drbdadm primary all", also "drbddisk" does the same thing I believe, so Heartbeat wouldn't start up), and all the data on the drbd partition was lost.

Have you seen anything like this happen before?

if in your drbd.conf disk{} section you have
on-io-error panic;
I would expect the node which lost it's disk to panic and the other to keep on going. [I count on this with a couple of mildly reliable Promise RM8000 arrays, and have seen it do it's job.]



We are a little concerned about drbd/heartbeat handling this situation
properly.  Ideally, the second node would be able to take over the drbd
resource, make itself primary and be able to access the data as usual. Is it possible that the devices were inconsistent and in the process of
synchronizing, when Heartbeat tried to do "drbdadm primary all" on the
second machine?  Would this cause the panic situtiation?  A similar
situation is described on drbd.org:

I would not expect a panic, but I would expect drbd, on an inconsistent secondary, to not let you mount the device. Now what a current heartbeat would do, if it could not mount the device, I do not know.


"Note that even though you can do so, it is a bad idea to have the
SyncTarget be Primary, it will panic on a network failure, because it
then has no more access to good data ..."

Did Heartbeat cause the panic?  It seems that the best thing Heartbeat
could do in this situation would be to shutdown completely.  Comments?


on the primary I would have expected a panic from drbd, but the panic on the secondary machine is not expected.

Eventually, we're going to try to reproduce this error, but it takes a
lot of effort to setup this test and then rebuild the machines, so we
will not be able to provide much more information at this time.

As Alan said, set the ko-count and on-disconnect values in the net{} section of drbd.conf, and include your conf in the next email so we know what you are set to.
BTW which protocol are you using?
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to