It was noticed in this list a few days ago that a SCSI controller
failure could blow up a raid array if there's more than one disk
connected to it. Well, it got me :-( :-( Here's the type of problem:
kernel: (scsi2:0:0:0) Parity error during Message-In phase.
kernel: scsi : aborting command due to timeout : pid 10859427, scsi2, channel 0, id 0,
lun 0 0x08 18 3c e8 08 00
kernel: (scsi2:0:0:0) Aborting scb 71, flags 0x4
kernel: (scsi2:0:0:0) SCB is currently active. Waiting on completion.
kernel: scsi : aborting command due to timeout : pid 10859428, scsi2, channel 0, id 0,
lun 0 0x08 1c 3c c0 08 00
etc. It got theses errors a few times for a few days, and then once
the raid code shouted "irrecoverable io error on ..." and the machine
was no longer useful. It didn't crash, but of course it wasn't that
much useful without the /home partition :-(
On reboot the disks in that controller got out of sync with the rest
of the array. Since there were two disks on the controller, the kernel
refused to start the array...
The "fix" is to have an up-to-date raidtab file and do a
"mkraid --really-force /dev/md<number>"
This will NOT erase the filesystem, it'll just rebuild the raid
superblocks and start parity reconstruction.
The downside is that after this e2fsck found a lot of problems, and
several files were lost. Furthermore, several others were corrupted. I
found pieces of mails I sent years ago on the trashed inbox of another
user... So much for privacy :-)
I don't see an affordable solution to this. One can always use a
single controller, but with decent disks it'll be saturated. And
having a single disk per controller will run out of slots.
I'm not sure the problem in this case is the controller itself. I just
put a terminator close to the controller and the problem didn't come
back yet. The controller SHOULD have terminated the bus though...