To my knowledge hot swap drives are no different from normal drives except you can
pause the SCSI bus before pulling them out/in.
The buslogic driver might be doing that... The block of data might have already been
writting in the raid driver but is stuck in your SCSI
cards buffer.. so the scsi card might be trying to write the block not the kernel.
The kernel should mark it bad if it can't write to it.. I'm not sure how much
communication there is between the RAID driver and the scsi
driver.
You would have to look at the docs for your SCSI card to see if you can change the
resets.. or look at the first few pages of C source.
One problem with Software RAID is that most of us have the two drives (In the case of
a mirror system) on the same SCSI bus. This causes
problems with the card trying to contact a dead drive and doing a SCSI- Reset will
reset the whole SCSI bus and stop the working drive from
talking to the Host. If the driver goes silly and sends many SCSI resets down the SCSI
bus your machine will lockup as the RAID driver won't
be able to talk to the other working drive.
This happened to me in a test of my mirror. It only happened a few times.. There was
nothing I could do about it as the only spare SCSI bus I
had was much slower then the one the drive was on:-(
A perfect setup would be to have each side of the mirror on a different SCSI bus/SCSI
card. that way no matter what happened to either side
the kernel would keep running.. even if the SCSI card died and you were using two SCSI
cards.
Sorry for not giving you a quick fix:-)
Thorsten Schwander wrote:
> Hi,
>
> we run a root raid-1 with two SCSI disks (details below) on a red-hat 5.2
> system with kernel 2.2.6 and raidpatches/tools from April. Previously we used
> kernels 2.2.4, 2.2.1, 2.1.131 which showed the same behavior.
>
> One of the disks died over a period of several month. Initially there was a
> SCSI reset about once per week in the kernel logs, the machine would stay up.
> During longer periods of uptime we saw several (5-10) scsi resets in the
> logs. Every other week the machine would completely freeze with messages
> about SCSI resets on console, nothing logged.
>
> We changed the SCSI controller, checked the cables, downgraded the buslogic
> firmware (there are reports of problems with resets under high I/O for the
> initial firmware version) etc.. The disk seemed to work fine and initially we
> didn't suspect it.
>
> Questions:
>
> The disks are hot-swappable, does this exacerbate the situation because the
> disk errors are somewhat hidden from the kernel?
>
> Is the buslogic driver too smart trying to reset itself instead of detecting
> a target as faulty?
>
> Shouldn't the kernel decide to mark a drive as bad after a certain number of
> resets for the same target? I append the logs from the final stages before
> the disk died. These show the extreme case of multiple resets in a relatively
> short time.
>
> Is the number of tolerated resets a configurable feature? If not, could this
> be implemented?
>
> Thanks
> T. Schwander
> (http://arXiv.org/)
>
> boot info:
> ======================================================================
>
> Jun 1 15:35:52 kernel: scsi0 : BusLogic BT-958
> Jun 1 15:35:52 kernel: scsi1 : BusLogic BT-958
> Jun 1 15:35:52 kernel: SCSI : 2 hosts.
> Jun 1 15:35:52 kernel: Vendor: IBM Model: DDRS-34560W Rev: S97B
> Jun 1 15:35:52 kernel: Type: Direct-Access ANSI SCSI
>revision: 02
> Jun 1 15:35:52 kernel: Detected scsi disk sda at scsi0, channel 0, id 0, lun 0
> Jun 1 15:35:52 kernel: Vendor: IBM Model: DDRS-34560W Rev: S97B
> Jun 1 15:35:52 kernel: Type: Direct-Access ANSI SCSI
>revision: 02
> Jun 1 15:35:52 kernel: Detected scsi disk sdb at scsi0, channel 0, id 1, lun 0
> Jun 1 15:35:52 kernel: scsi0: Target 0: Queue Depth 28, Wide Synchronous at 20.0
>MB/sec, offset 15
> Jun 1 15:35:52 kernel: scsi0: Target 1: Queue Depth 28, Wide Synchronous at 20.0
>MB/sec, offset 15
>
> kernel messages
> ======================================================================
<SNIP ERROR MESSAGES>