Re: Proposed RAID5 design changes.

Neil Brown Fri, 16 Mar 2001 01:43:02 -0800

(I've taken Alan and Linus off the Cc list. I'm sure they have plenty
to read, and may even be on linux-raid anyway).

On Thursday March 15, [EMAIL PROTECTED] wrote:
> I'm not too happy with the linux RAID5 implementation. In my
> opinion, a number of changes need to be made, but I'm not sure how to
> make them or get them accepted into the official distribution if I did
> make the changes.

I've been doing a fair bit of development with RAID5 lately and Linus
seems happy to accept patches from me, and I am happy to work with you
(or anyone else) to make improvements and them submit them to Linus.

There was a paper in 
   2000 USENIX Annual Technical Conference
titled
   Towards Availability Benchmarks: A Case Study of Software RAID
   Systems
by
   Aaron Brown and David A. Patterson of UCB.

They built a neat rig for testing fault tolerance and fault handling
in raid systems and compared Linux, Solaris, and WinNT.

Their particular comment about Linux was that it seemed to evict
drives on any excuse, just as you observe.  Apparently the other
systems tried much harder to keep drives in the working set if
possible.
It is certainly worth a read if you are interested in this.

My feeling about retrying after failed IO is that it should be done at
a lower level.  Once the SCSI or IDE level tells us that there is a
READ error, or a WRITE error, we should believe them.
Now it appears that this isn't true: at least not for all drivers.
So while I would not be strongly against putting that sort of re-try
logic at the RAID level, I think it would be worth the effort to find
out why it isn't being done at a lower level.

As for re-writing after a failed read, that certainly makes sense, and
probably wouldn't be too hard.
You would introduce into the "struct stripe_head" a way to mark a
drive as "read-failed".
Then on a read error, you mark that drive as read-failed in that
stripe only and schedule a retry.
If the retry succeeds, you then schedule a write, and if that
works, you just continue on happily.

You would need to make sure that you aren't too generous: once you
have had some number of read errors on a given drive you really should
fail that drive anyway.

> 3) Drives should not be kicked out of the array unless they are having
>    really persistent problems. I've an idea on how to define 'really
>    persistent' but it requires a bit of math to explain, so I'll only
>    go into it if someone is interested.

I'd certainly be interested in reading your math.

> 
> Then there are two changes that might improve recovery performance:
> 
> 4) If the drive being kicked out is not totally inoperable and there is
>    a spare drive to replace it, try to copy the data from the failing
>    drive to the spare rather than reconstructing the data from all the
>    other disks. Fall back to full reconstruction if the error rate gets
>    too high.

That would actually be fairly easy to do.  Once you get the data
structures right so that the concept of a "partially failed" drive can
be clearly represented, it should be a cinch.

> 
> 5) When doing (4) use the SCSI 'copy' command if the drives are on the
>    same bus, and the host adapter and driver supports 'copy'. However,
>    this should be done with caution. 'copy' is not generally used and
>    any number of undetected firmware bugs might make it unreliable.
>    An additional category may need to be added to the device black list
>    to flag devices that can not do 'copy' reliably.

I've very against this sort of idea.  Currently the raid code is
blissfully unaware of the underlying technology: it could be scsi,
ide, ramdisc, netdisk or anything else and RAID just doesn't care.
This I believe is one of the strengths of software RAID - it is
cleanly layered.
Firmware (==hardware) raid controllers often try to "know" a lot about
the underlying drive - even to the extent of getting the drives  to do
the XOR themselves I believe.  This undoubtedly makes the code more
complex, and can lead to real problems if you have firmware-mismatches
(and we have had a few of those).

Stick with "read" and "write" I think.  Everybody understands what
they mean so it is much more likely to work.
And really, our rebuild performance isn't that bad.  The other interesting result
for Linux in that paper is that rebuild made almost no impact on
performance with Linux, while it did for solaris and NT (but Linux did
rebuild much more slowly).

So if you want to do this, dive right in and have a go.
I am certainly happy to answer any questions, review any code, and
forward anything that looks good to Linus.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
Re: Proposed RAID5 design changes.

Reply via email to