Dan Jones wrote: > ... > > Well, let's see, you have a hard (unrecovered) read error followed by > a SCSI parity error (see below) [snipped]. So, you seem to have both a > marginal drive and a marginal SCSI bus. Perhaps this example is unrelated > to your original request for additional error recovery for transient > errors. In my experience, being able to continue from a single, recoverable > SCSI error is reasonable, but recovery from two SCSI errors, of any kind, > is not. As you indicated, you are trying to make do with the equipment > you have, but that doesn't change the natural laws. Since you need the > trays for cooling, one option left is to reduce the SCSI transfer rate > and give it a try. The maximum rate is normally a setting in the > adapter's SCSI BIOS. My understanding is that the term 'unrecovered' means that no recovery was attempted and that a second attempt is needed to initiate the recovery process. This may simply be a mis-understanding of the exact meaning of the error codes on my part, a bad brief description of the error or a bad choice of codes on the disk makers part, but an examination of the list of all possible error codes seems to indicate this error means that further options are available and that the driver should chose an appropriate action, like retrying the operation. As for the parity errors, I've seen this on other (older) systems where the drive electronics is having problems and the bus itself is in good shape. They happened to the spare (4th) drive, not to the drive with the media error. They happened AFTER the point where a retry on the original drive should have taken place. I've been through the error recovery code with a fine tooth comb and found at least one bug in it (reported in another thread some time ago), so I think I know what I should see if a retry was attempted, and it simply is not there. Being suspicious of my own understanding, I asked for someone else to check and either a) point out where the retry was attempted or b) agree that it wasn't done and needs to be done, or c) explain why it should not be done. Your focus on the bus (and the swap trays impact on the bus) is unreasonable given the error rate (which would be highest during the day when the bus sees a lot of activity rather than at 4 a.m. in the morning when the whole system is idle). I won't say 'forget the bus', but lets look at the question I asked rather than looking for excuses not to answer it. (I'm aware that the trays introduce a stub on the bus in addition to the normal stubs from the drives internal wiring and increases the ringing and transient noise on the edges of signals, but the error rate would be much higher and occur at different times if this was a significant problem.) As for the BIOS, this is the second SCSI adapter and the machine is an APX (a.k.a an Alpha), so the BIOS is not active and there is no opportunity to set the bus speed with it. ... The cause and handling of the parity errors is a secondary (but probably still important) problem. If it needs to be covered (and it probably should be), let's do it after the question about error recovery for more mundane errors has been closed. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
