I'm not too happy with the linux RAID5 implementation. In my
opinion, a number of changes need to be made, but I'm not sure how to
make them or get them accepted into the official distribution if I did
make the changes.

The changes I think should be made in order of priority are:

1) Read and write errors should be retried at least once before kicking
   the drive out of the array.

2) On more persistent read errors, the failed block (or whatever unit is
   represented by a buffer) should be reconstructed from the parity set,
   and the buffer marked dirty so good data is written back to the disk
   with the error.

3) Drives should not be kicked out of the array unless they are having
   really persistent problems. I've an idea on how to define 'really
   persistent' but it requires a bit of math to explain, so I'll only
   go into it if someone is interested.

Then there are two changes that might improve recovery performance:

4) If the drive being kicked out is not totally inoperable and there is
   a spare drive to replace it, try to copy the data from the failing
   drive to the spare rather than reconstructing the data from all the
   other disks. Fall back to full reconstruction if the error rate gets
   too high.

5) When doing (4) use the SCSI 'copy' command if the drives are on the
   same bus, and the host adapter and driver supports 'copy'. However,
   this should be done with caution. 'copy' is not generally used and
   any number of undetected firmware bugs might make it unreliable.
   An additional category may need to be added to the device black list
   to flag devices that can not do 'copy' reliably.

[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Reply via email to