I hope this is the appropriate place for this suggestion.  Skip to
"Suggestion" if tldr

Problem Statement:
I currently use RAID-Z2 to protect against a failure scenario of one drive
failure+corrupt data in remaining drives.  Consider in classic RAID5 often
during a rebuild after a failed disk, corrupt data is encountered and the
rebuild fails or files are corrupt, due to latent undiscovered errors.

RAID-Z is usually sufficient for this assuming no bit rot or corruption has
occurred since the last scrub.  However there is still a risk that some
data may have been corrupted since last scrub, and thus a drive failure my
result in unrecoverable data.  Hence the reason I use RAID-Z2.  (I've seen
it explained in the past that if there were corrupt data in the remaining
drives after a drive failure, then that data is unrecoverable.  My
apologies if this assertion is incorrect.)

I feel the odds of these scenarios are greater than the odds of a two drive
failure.  Historically it seemed like the biggest argument against RAID5
was the rebuild failure due to latent errors.

A mirror has the same risk.  It can be rebuilt in a drive failure scenario,
but if there is corrupt data(since last scrub) it cannot be repaired.  I
think this risk is greater for home NAS systems where people only
occasionally turn their NAS on, and thus it may have been sitting for a
long time since the last scrub.

I realize the probability of such a scenario is arguable, but plenty of
people have given strong arguments against the need for something like
RADIZ2/RAID6 to protect against two drive failures purely based on
speculative probabilities, but that doesn't mean its not a risk.  Failures
are rare, and things like the RAID5 rebuild failure scenario wasn't well
publicized until sometime after that.  So while one can probably build some
strong arguments against the desire(I won't say "need") for such a feature,
I think it has merit.

Suggestion:
Implement reed-solomon error correction similar to some compression
archiving software, such that you choose a percentage of data that should
be stored on each drive to be used as error correction data, potentially
during a rebuild.

Example: I create a mirror and specify ErrorTolerance = 10%, write 100 GB
of data, and ZFS generates 10GB of error correction data on each drive.

Currently, the closest thing to this is to create a mirror with copies=2 to
allow error correction during rebuild, in the case that some data in the
remaining mirror happens to be corrupted.  This uses significantly more
space.

Example2: I create a RAID-Z with 3 drives with ErrorTolerance = 10%, write
100 GB of data, and ZFS generates 5 gb error correction data on each
drive(10% of 50gb per drive actual usage).

This setup can sustain one drive failure + corrupt data in remaining two
drives, and still recover completely assuming there is less than 10%
corrupt data.

Given that periodic scrubbing significantly reduces the chances that large
amounts of errors would be encountered, probably a very small amount of
ErrorTolerance would be sufficient, probably as little as 2% since we are
only guarding against the errors that might occur between the last scrub
and the next.

Benefits: You are able to obtain protection against the drive
failure+corruption in remaining scenario with significantly less space used
and/or with less drives.  If you are concerned about this scenario, but not
worried about the 2 drive failure scenario, then RAIDZ-2 or Mirror+COPIES=2
is overkill.  This feature would provide a happy medium where you devote
some additional space to extra data protection without having to commit
additional drives or double capacity to accommodate COPES=2.

Possible Implementation(this might sound naive, but I put some thought into
how this might be done efficiently):
You could write larger chunks of the error correction code periodically
after each write.  It doesn't have to be written as part of the write, but
occur at some point "shortly" afterwards.  This would allow you to
accumulate writes which you need to write error correction for, and do that
as one batch.  Probably wait for enough to make it worth the calculation,
since if you are doing 2% then you are only going to be writing one block
of error correction code for every 50 blocks.  You'd have to track which
blocks it is associated with.  I don't know if blocks written sequentially
have some sort of sequential identifier, but maybe each reed solomon code
could get away with just a bit of meta data on the first and last block it
applies to.  That's beyond my knowledge of ZFS.  Of course there's probably
lots of interesting problems to solve in having a buffered/delayed reed
solomon write. I certainly don't mean to imply that I think this would be
easy to do.

Thank you for your consideration.
______________
Aaron Shumaker
"I object to doing things that computers can do." -Olin Shivers
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to