[Linux-ha-dev] Re: [Linux-HA] kernel panic on disk failure

Todd Denniston Tue, 31 Jan 2006 06:34:41 -0800

Nate Reed wrote:

Here's a strange situation that has me worried about whether Heartbeat isdoing the correct thing to failover DRBD and related resources. I knownothing strange ever happens in the real world, but...


Ha. Ever try to push an RM8000 array to it's SCSI bus limits?

As a test to see if I could break the cluster, I maliciously yanked out theRAID disks on the active node. As a result, there was a kernel panic on bothmachines.


Both machines panic-ed?
or did one panic and the other halt with a message of "!DRBD! pri on 
incon-degr"?

To recover, I had to reboot them both (and reinstall the OS on thefirst machine). I was not able to bring up the second machine as the primary("drbdadm primary all", also "drbddisk" does the same thing I believe, soHeartbeat wouldn't start up), and all the data on the drbd partition waslost.
Have you seen anything like this happen before?


if in your drbd.conf disk{} section you have
on-io-error panic;

I would expect the node which lost it's disk to panic and the other to keep ongoing. [I count on this with a couple of mildly reliable Promise RM8000arrays, and have seen it do it's job.]

We are a little concerned about drbd/heartbeat handling this situation
properly.  Ideally, the second node would be able to take over the drbd
resource, make itself primary and be able to access the data as usual.Is it possible that the devices were inconsistent and in the process of
synchronizing, when Heartbeat tried to do "drbdadm primary all" on the
second machine?  Would this cause the panic situtiation?  A similar
situation is described on drbd.org:

I would not expect a panic, but I would expect drbd, on an inconsistentsecondary, to not let you mount the device. Now what a current heartbeatwould do, if it could not mount the device, I do not know.


"Note that even though you can do so, it is a bad idea to have the
SyncTarget be Primary, it will panic on a network failure, because it
then has no more access to good data ..."

Did Heartbeat cause the panic?  It seems that the best thing Heartbeat
could do in this situation would be to shutdown completely.  Comments?

on the primary I would have expected a panic from drbd, but the panic on thesecondary machine is not expected.

Eventually, we're going to try to reproduce this error, but it takes a
lot of effort to setup this test and then rebuild the machines, so we
will not be able to provide much more information at this time.

As Alan said, set the ko-count and on-disconnect values in the net{} sectionof drbd.conf, and include your conf in the next email so we know what you areset to.

BTW which protocol are you using?
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] Re: [Linux-HA] kernel panic on disk failure

Reply via email to