I am running raid0 over raid1. Two striped pairs mirrored. Two internal
scsi drives and two external drives - each in their own case. 4 separate
scsi controllers (actually 2 dual channel controllers).

They were rewiring the lab where the raid system exists and decided that
they needed to move it. So being cabling guys, they carefully moved the
tower and two external scsi boxes. Unfortunately, they disconnected one
of the external scsi drives from the controller which brought the linux
box to a grinding halt with an error message from the controller card
(aic7xxx driver). The error message indicated that scsi0 had timed out
and it was trying harder to reconnect. 

I could not login to shutdown or reboot. I tried disconnecting the power
to the UPS, hoping that the UPS daemon would still work and shutdown but
no luck. I then tried shutting off and on scsi0 drive to see if that
would help - no luck.

So next, I tried the three finger salute and nothing. So off went the
power. When I turned it back on, they raid rebuilt itself for about 45
minutes and then I could log on in a "crippled" mode while it finished
rebuilding and fixing the raid devices which took about another 30
minutes.

Voila, it was back to normal! Fantastic. Kudos to you raid developers!

Now for the questions: If a hard-drive fails like in this case,
shouldn't only half the mirror be affected? Couldn't the system still
function on the good half? This looks to be the aic7xxx driver problem?
Why would the hardware failure of the aic7xxx driver to communicate with
a drive knock out the complete system?

I would think that if a drive failed running a mirrored raid that the
system would realize this, shut down the bad mirrored half, send an
error message to everyone (root?) and function on the good half of the
mirror.

TIA

Reply via email to