Re: Transient disk failures not retried?

Max TenEyck Woodbury Thu, 18 Jan 2001 06:39:03 -0800
Dan Jones wrote:
> ...
> 
> Well, let's see, you have a hard (unrecovered) read error followed by
> a SCSI parity error (see below) [snipped]. So, you seem to have both a
> marginal drive and a marginal SCSI bus. Perhaps this example is unrelated
> to your original request for additional error recovery for transient
> errors. In my experience, being able to continue from a single, recoverable
> SCSI error is reasonable, but recovery from two SCSI errors, of any kind,
> is not. As you indicated, you are trying to make do with the equipment
> you have, but that doesn't change the natural laws. Since you need the
> trays for cooling, one option left is to reduce the SCSI transfer rate
> and give it a try. The maximum rate is normally a setting in the
> adapter's SCSI BIOS.

My understanding is that the term 'unrecovered' means that no recovery
was attempted and that a second attempt is needed to initiate the
recovery process. This may simply be a mis-understanding of the exact
meaning of the error codes on my part, a bad brief description of the
error or a bad choice of codes on the disk makers part, but an examination
of the list of all possible error codes seems to indicate this error
means that further options are available and that the driver should 
chose an appropriate action, like retrying the operation.

As for the parity errors, I've seen this on other (older) systems
where the drive electronics is having problems and the bus itself
is in good shape. They happened to the spare (4th) drive, not to the
drive with the media error. They happened AFTER the point where a
retry on the original drive should have taken place. I've been through
the error recovery code with a fine tooth comb and found at least one
bug in it (reported in another thread some time ago), so I think I know
what I should see if a retry was attempted, and it simply is not there.
Being suspicious of my own understanding, I asked for someone else to
check and either a) point out where the retry was attempted or b) agree
that it wasn't done and needs to be done, or c) explain why it should
not be done.

Your focus on the bus (and the swap trays impact on the bus) is 
unreasonable given the error rate (which would be highest during
the day when the bus sees a lot of activity rather than at 4 a.m.
in the morning when the whole system is idle). I won't say 'forget
the bus', but lets look at the question I asked rather than looking
for excuses not to answer it. (I'm aware that the trays introduce
a stub on the bus in addition to the normal stubs from the drives
internal wiring and increases the ringing and transient noise on
the edges of signals, but the error rate would be much higher and 
occur at different times if this was a significant problem.)

As for the BIOS, this is the second SCSI adapter and the machine
is an APX (a.k.a an Alpha), so the BIOS is not active and there is
no opportunity to set the bus speed with it.

...

The cause and handling of the parity errors is a secondary (but 
probably still important) problem. If it needs to be covered (and
it probably should be), let's do it after the question about
error recovery for more mundane errors has been closed.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
Re: Transient disk failures not retried?

Reply via email to