Re: scsi resets/feature request

David Robinson Mon, 7 Jun 1999 16:10:41 -0700
To my knowledge hot swap drives are no different from normal drives except you can 
pause the SCSI bus before pulling them out/in.

The buslogic driver might be doing that... The block of data might have already been 
writting in the raid driver but is stuck in your SCSI
cards buffer.. so the scsi card might be trying to write the block not the kernel.

The kernel should mark it bad if it can't write to it.. I'm not sure how much 
communication there is between the RAID driver and the scsi
driver.

You would have to look at the docs for your SCSI card to see if you can change the 
resets.. or look at the first few pages of C source.

One problem with Software RAID is that most of us have the two drives (In the case of 
a mirror system) on the same SCSI bus. This causes
problems with the card trying to contact a dead drive and doing a SCSI- Reset will 
reset the whole SCSI bus and stop the working drive from
talking to the Host. If the driver goes silly and sends many SCSI resets down the SCSI 
bus your machine will lockup as the RAID driver won't
be able to talk to the other working drive.
This happened to me in a test of my mirror. It only happened a few times.. There was 
nothing I could do about it as the only spare SCSI bus I
had was much slower then the one the drive was on:-(

A perfect setup would be to have each side of the mirror on a different SCSI bus/SCSI 
card. that way no matter what happened to either side
the kernel would keep running.. even if the SCSI card died and you were using two SCSI 
cards.

Sorry for not giving you a quick fix:-)


Thorsten Schwander wrote:

> Hi,
>
> we run a root raid-1 with two SCSI disks (details below) on a red-hat 5.2
> system with kernel 2.2.6 and raidpatches/tools from April. Previously we used
> kernels 2.2.4, 2.2.1, 2.1.131 which showed the same behavior.
>
> One of the disks died over a period of several month. Initially there was a
> SCSI reset about once per week in the kernel logs, the machine would stay up.
> During longer periods of uptime we saw several (5-10) scsi resets in the
> logs.  Every other week the machine would completely freeze with messages
> about SCSI resets on console, nothing logged.
>
> We changed the SCSI controller, checked the cables, downgraded the buslogic
> firmware (there are reports of problems with resets under high I/O for the
> initial firmware version) etc.. The disk seemed to work fine and initially we
> didn't suspect it.
>
> Questions:
>
> The disks are hot-swappable, does this exacerbate the situation because the
> disk errors are somewhat hidden from the kernel?
>
> Is the buslogic driver too smart trying to reset itself instead of detecting
> a target as faulty?
>
> Shouldn't the kernel decide to mark a drive as bad after a certain number of
> resets for the same target? I append the logs from the final stages before
> the disk died. These show the extreme case of multiple resets in a relatively
> short time.
>
> Is the number of tolerated resets a configurable feature? If not, could this
> be implemented?
>
> Thanks
> T. Schwander
> (http://arXiv.org/)
>
> boot info:
> ======================================================================
>
> Jun  1 15:35:52 kernel: scsi0 : BusLogic BT-958
> Jun  1 15:35:52 kernel: scsi1 : BusLogic BT-958
> Jun  1 15:35:52 kernel: SCSI : 2 hosts.
> Jun  1 15:35:52 kernel:   Vendor: IBM       Model: DDRS-34560W       Rev: S97B
> Jun  1 15:35:52 kernel:   Type:   Direct-Access                      ANSI SCSI 
>revision: 02
> Jun  1 15:35:52 kernel: Detected scsi disk sda at scsi0, channel 0, id 0, lun 0
> Jun  1 15:35:52 kernel:   Vendor: IBM       Model: DDRS-34560W       Rev: S97B
> Jun  1 15:35:52 kernel:   Type:   Direct-Access                      ANSI SCSI 
>revision: 02
> Jun  1 15:35:52 kernel: Detected scsi disk sdb at scsi0, channel 0, id 1, lun 0
> Jun  1 15:35:52 kernel: scsi0: Target 0: Queue Depth 28, Wide Synchronous at 20.0 
>MB/sec, offset 15
> Jun  1 15:35:52 kernel: scsi0: Target 1: Queue Depth 28, Wide Synchronous at 20.0 
>MB/sec, offset 15
>
> kernel messages
> ======================================================================

<SNIP ERROR MESSAGES>
Re: scsi resets/feature request

Reply via email to