I hope this is the appropriate place for this suggestion. Skip to "Suggestion" if tldr
Problem Statement: I currently use RAID-Z2 to protect against a failure scenario of one drive failure+corrupt data in remaining drives. Consider in classic RAID5 often during a rebuild after a failed disk, corrupt data is encountered and the rebuild fails or files are corrupt, due to latent undiscovered errors. RAID-Z is usually sufficient for this assuming no bit rot or corruption has occurred since the last scrub. However there is still a risk that some data may have been corrupted since last scrub, and thus a drive failure my result in unrecoverable data. Hence the reason I use RAID-Z2. (I've seen it explained in the past that if there were corrupt data in the remaining drives after a drive failure, then that data is unrecoverable. My apologies if this assertion is incorrect.) I feel the odds of these scenarios are greater than the odds of a two drive failure. Historically it seemed like the biggest argument against RAID5 was the rebuild failure due to latent errors. A mirror has the same risk. It can be rebuilt in a drive failure scenario, but if there is corrupt data(since last scrub) it cannot be repaired. I think this risk is greater for home NAS systems where people only occasionally turn their NAS on, and thus it may have been sitting for a long time since the last scrub. I realize the probability of such a scenario is arguable, but plenty of people have given strong arguments against the need for something like RADIZ2/RAID6 to protect against two drive failures purely based on speculative probabilities, but that doesn't mean its not a risk. Failures are rare, and things like the RAID5 rebuild failure scenario wasn't well publicized until sometime after that. So while one can probably build some strong arguments against the desire(I won't say "need") for such a feature, I think it has merit. Suggestion: Implement reed-solomon error correction similar to some compression archiving software, such that you choose a percentage of data that should be stored on each drive to be used as error correction data, potentially during a rebuild. Example: I create a mirror and specify ErrorTolerance = 10%, write 100 GB of data, and ZFS generates 10GB of error correction data on each drive. Currently, the closest thing to this is to create a mirror with copies=2 to allow error correction during rebuild, in the case that some data in the remaining mirror happens to be corrupted. This uses significantly more space. Example2: I create a RAID-Z with 3 drives with ErrorTolerance = 10%, write 100 GB of data, and ZFS generates 5 gb error correction data on each drive(10% of 50gb per drive actual usage). This setup can sustain one drive failure + corrupt data in remaining two drives, and still recover completely assuming there is less than 10% corrupt data. Given that periodic scrubbing significantly reduces the chances that large amounts of errors would be encountered, probably a very small amount of ErrorTolerance would be sufficient, probably as little as 2% since we are only guarding against the errors that might occur between the last scrub and the next. Benefits: You are able to obtain protection against the drive failure+corruption in remaining scenario with significantly less space used and/or with less drives. If you are concerned about this scenario, but not worried about the 2 drive failure scenario, then RAIDZ-2 or Mirror+COPIES=2 is overkill. This feature would provide a happy medium where you devote some additional space to extra data protection without having to commit additional drives or double capacity to accommodate COPES=2. Possible Implementation(this might sound naive, but I put some thought into how this might be done efficiently): You could write larger chunks of the error correction code periodically after each write. It doesn't have to be written as part of the write, but occur at some point "shortly" afterwards. This would allow you to accumulate writes which you need to write error correction for, and do that as one batch. Probably wait for enough to make it worth the calculation, since if you are doing 2% then you are only going to be writing one block of error correction code for every 50 blocks. You'd have to track which blocks it is associated with. I don't know if blocks written sequentially have some sort of sequential identifier, but maybe each reed solomon code could get away with just a bit of meta data on the first and last block it applies to. That's beyond my knowledge of ZFS. Of course there's probably lots of interesting problems to solve in having a buffered/delayed reed solomon write. I certainly don't mean to imply that I think this would be easy to do. Thank you for your consideration. ______________ Aaron Shumaker "I object to doing things that computers can do." -Olin Shivers
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
