I'm not too happy with the linux RAID5 implementation. In my
opinion, a number of changes need to be made, but I'm not sure how to
make them or get them accepted into the official distribution if I did
make the changes.
The changes I think should be made in order of priority are:
1) Read and write errors should be retried at least once before kicking
the drive out of the array.
2) On more persistent read errors, the failed block (or whatever unit is
represented by a buffer) should be reconstructed from the parity set,
and the buffer marked dirty so good data is written back to the disk
with the error.
3) Drives should not be kicked out of the array unless they are having
really persistent problems. I've an idea on how to define 'really
persistent' but it requires a bit of math to explain, so I'll only
go into it if someone is interested.
Then there are two changes that might improve recovery performance:
4) If the drive being kicked out is not totally inoperable and there is
a spare drive to replace it, try to copy the data from the failing
drive to the spare rather than reconstructing the data from all the
other disks. Fall back to full reconstruction if the error rate gets
too high.
5) When doing (4) use the SCSI 'copy' command if the drives are on the
same bus, and the host adapter and driver supports 'copy'. However,
this should be done with caution. 'copy' is not generally used and
any number of undetected firmware bugs might make it unreliable.
An additional category may need to be added to the device black list
to flag devices that can not do 'copy' reliably.
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]