Re: raid6 check/repair
Dear Neil, this thread has died out, but I'd prefer not to let it end without any kind of result being reached. Therefore, I'm kindly asking you to draw a conclusion from the arguments being exchanged: Concerning the implementation of a 'repair' that can actually recover data in some cases instead of just recalculating parity: Do you a) oppose the case (patches not accepted) b) don't care (but potentially accept patches) c) support it Thank you very much and kind regards, Thiemo Nagel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
On Wed, Dec 05, 2007 at 03:31:14PM -0500, Bill Davidsen wrote: BTW: if this can be done in a user program, mdadm, rather than by code in the kernel, that might well make everyone happy. Okay, realistically less unhappy. I start to like the idea. Of course you can't repair a running array from user space (just think about something re-writing the full stripe while mdadm is trying to fix the old data - you can get the data disks containing the new data but the fixed disks rewritten with the old data). We just need to make the kernel not to try to fix anything but merely report that something is wrong - but wait, using check instead of repair does that already. So the kernel is fine as it is, we just need a simple user-space utility that can take the components of a non-running array and repair a given stripe using whatever method is appropriate. Shouldn't be too hard to write for anyone interested... Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
On 15:31, Bill Davidsen wrote: Thiemo posted metacode which I find appears correct, It assumes that _exactly_ one disk has bad data which is hard to verify in practice. But yes, it's probably the best one can do if both P and Q happen to be incorrect. IMHO mdadm shouldn't do this automatically though and should always keep backup copies of the data it overwrites with good data. Andre -- The only person who always got his work done by Friday was Robinson Crusoe signature.asc Description: Digital signature
Re: raid6 check/repair
Peter Grandi wrote: [ ... on RAID1, ... RAID6 error recovery ... ] tn The use case for the proposed 'repair' would be occasional, tn low-frequency corruption, for which many sources can be tn imagined: tn Any piece of hardware has a certain failure rate, which may tn depend on things like age, temperature, stability of tn operating voltage, cosmic rays, etc. but also on variations tn in the production process. Therefore, hardware may suffer tn from infrequent glitches, which are seldom enough, to be tn impossible to trace back to a particular piece of equipment. tn It would be nice to recover gracefully from that. What has this got to do with RAID6 or RAID in general? I have been following this discussion with a sense of bewilderment as I have started to suspect that parts of it are based on a very large misunderstanding. tn Kernel bugs or just plain administrator mistakes are another tn thing. The biggest administrator mistakes are lack of end-to-end checking and backups. Those that don't have them wish their storage systems could detect and recover from arbitrary and otherwise undetected errors (but see below for bad news on silent corruptions). tn But also the case of power-loss during writing that you have tn mentioned could profit from that 'repair': With heterogeneous tn hardware, blocks may be written in unpredictable order, so tn that in more cases graceful recovery would be possible with tn 'repair' compared to just recalculating parity. Redundant RAID levels are designed to recover only from _reported_ errors that identify precisely where the error is. Recovering from random block writing is something that seems to me to be quite outside the scope of a low level virtual storage device layer. ms I just want to give another suggestion. It may or may not be ms possible to repair inconsistent arrays but in either way some ms code there MUST at least warn the administrator that ms something (may) went wrong. tn Agreed. That sounds instead quite extraordinary to me because it is not clear how to define ''inconsistency'' in the general case never mind detect it reliably, and never mind knowing when it is found how to determine which are the good data bits and which are the bad. Now I am starting to think that this discussion is based on the curious assumption that storage subsystems should solve the so called ''byzantine generals'' problem, that is to operate reliably in the presence of unreliable communications and storage. I had missed that. In fact, after rereading most of the thread I *still* miss that, so perhaps it's not there. What the OP proposed was that in the case where there is incorrect data on exactly one chunk in a raid-6 slice that the incorrect chunk be identified and rewritten with correct data. This is based on the assumptions that (a) this case can be identified, (b) the correct data value for the chunk can be calculated, (c) this only adds processing or i/o overhead when an error condition is identified by the existing code, and (d) this can be done without significant additional i/o other than rewriting the corrected data. Given these assumptions the reasons for not adding this logic would seem to be (a) one of the assumptions is wrong, (b) it would take a huge effort to code or maintain, or (c) it's wrong for raid to fix errors other than hardware, even if it could do so. Although I've looked at the logic in metadata form, and the code for doing the check now, I realize that the assumptions could be wrong, and invite enlightenment. But Thiemo posted metacode which I find appears correct, so I don't think it's a huge job to code, and since it is in a code path which currently always hides an error, it's hard to understand how added code could make things worse than they are. I can actually see the philosophical argument about doing only disk errors in raid code, but at least it should be a clear decision made for that reason, and not hidden by arguments that this happens rarely. Given the state of current hardware, I think virtually all errors happen rarely, the problem is that all problems happen occasionally (ref. Murphy's Law). We have a tool (check) which finds these problems, why not a tools to fix them? BTW: if this can be done in a user program, mdadm, rather than by code in the kernel, that might well make everyone happy. Okay, realistically less unhappy. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
[EMAIL PROTECTED] (Peter Grandi) writes: ms I just want to give another suggestion. It may or may not be ms possible to repair inconsistent arrays but in either way some ms code there MUST at least warn the administrator that ms something (may) went wrong. tn Agreed. That sounds instead quite extraordinary to me because it is not clear how to define ''inconsistency'' in the general case never mind detect it reliably, and never mind knowing when it is found how to determine which are the good data bits and which are the bad. I don't quite follow you. Having a basic consistency check utility for a raid array is to me as obvious as having an fsck utility for a file system. Now I am starting to think that this discussion is based on the curious assumption that storage subsystems should solve the so called ''byzantine generals'' problem, that is to operate reliably in the presence of unreliable communications and storage. I don't think anyone is proposing to solve that problem. However, an occasional slight nod in acknowledgment of the fact that real world communications and storage *are* unreliable wouldn't be out of place. -- Leif Nixon -Systems expert National Supercomputer Centre- Linkoping University - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
[ ... on RAID1, ... RAID6 error recovery ... ] tn The use case for the proposed 'repair' would be occasional, tn low-frequency corruption, for which many sources can be tn imagined: tn Any piece of hardware has a certain failure rate, which may tn depend on things like age, temperature, stability of tn operating voltage, cosmic rays, etc. but also on variations tn in the production process. Therefore, hardware may suffer tn from infrequent glitches, which are seldom enough, to be tn impossible to trace back to a particular piece of equipment. tn It would be nice to recover gracefully from that. What has this got to do with RAID6 or RAID in general? I have been following this discussion with a sense of bewilderment as I have started to suspect that parts of it are based on a very large misunderstanding. tn Kernel bugs or just plain administrator mistakes are another tn thing. The biggest administrator mistakes are lack of end-to-end checking and backups. Those that don't have them wish their storage systems could detect and recover from arbitrary and otherwise undetected errors (but see below for bad news on silent corruptions). tn But also the case of power-loss during writing that you have tn mentioned could profit from that 'repair': With heterogeneous tn hardware, blocks may be written in unpredictable order, so tn that in more cases graceful recovery would be possible with tn 'repair' compared to just recalculating parity. Redundant RAID levels are designed to recover only from _reported_ errors that identify precisely where the error is. Recovering from random block writing is something that seems to me to be quite outside the scope of a low level virtual storage device layer. ms I just want to give another suggestion. It may or may not be ms possible to repair inconsistent arrays but in either way some ms code there MUST at least warn the administrator that ms something (may) went wrong. tn Agreed. That sounds instead quite extraordinary to me because it is not clear how to define ''inconsistency'' in the general case never mind detect it reliably, and never mind knowing when it is found how to determine which are the good data bits and which are the bad. Now I am starting to think that this discussion is based on the curious assumption that storage subsystems should solve the so called ''byzantine generals'' problem, that is to operate reliably in the presence of unreliable communications and storage. ms I had an issue once where the chipset / mainboard was broken ms so on one raid1 array I have diferent data was written to the ms disks occasionally [ ... ] Indeed. Some links from a web search: http://en.Wikipedia.org/wiki/Byzantine_Fault_Tolerance http://pages.CS.Wisc.edu/~sschang/OS-Qual/reliability/byzantine.htm http://research.Microsoft.com/users/lamport/pubs/byz.pdf ms and linux-raid / mdadm did not complain or do anything. The mystic version of Linux-RAID is in psi-test right now :-). To me RAID does not seem the right abstraction level to deal with this problem; and perhaps the file system level is not either, even if ZFS tries to address some of the problem. However there are ominous signs that the storage version of the Byzantine generals problem is happening in particularly nasty forms. For example as reported in this very very scary paper: https://InDiCo.DESY.DE/contributionDisplay.py?contribId=65sessionId=42confId=257 where some of the causes have been apparently identified recently, see slides 11, 12 and 13: http://InDiCo.FNAL.gov/contributionDisplay.py?contribId=44amp;sessionId=15amp;confId=805 So I guess that end-to-end verification will have to become more common, but which form it will take is not clear (I always use a checksummed container format for important long term data). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
On Tue, 4 Dec 2007, Peter Grandi wrote: ms and linux-raid / mdadm did not complain or do anything. The mystic version of Linux-RAID is in psi-test right now :-). To me RAID does not seem the right abstraction level to deal with this problem; and perhaps the file system level is not either, even if ZFS tries to address some of the problem. Hm. If I run a check on a raid1, I would expect it to read data from both disks and compare them, and complain if it's not identical. Are you sure you really mean what you're saying here? I do realise that if the corruption happens above the raid layer then there is nothing we can do, but if md asks to write a block to two raid1 disks and the system corrupts the write and writes different data to the two different drives in the raid1, then when md does check at a later time and discovers this, it should scream bloody murder, choose one of the data and replicate it to the other one...? I know this might as well be the wrong data, but md can't figure that out, but it should correct the *raid1* inconsistancy, which I think is what the person you replied to meant? -- Mikael Abrahamssonemail: [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
Dear Michael, Michael Schmitt wrote: Hi folks, Probably erroneously, you have sent this mail only to me, not to the list... I just want to give another suggestion. It may or may not be possible to repair inconsistent arrays but in either way some code there MUST at least warn the administrator that something (may) went wrong. Agreed. I don't know if linux-raid is the right code to implement this, but I think it is the most obvious place to implement it I guess. It's on the todo-list, I think, judging from this Debian bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=405919 Kind regards, Thiemo Nagel I thought this suggestion was once noted in this thread but I am not sure and I did not find it anymore, so please bare with me if I wrote it again. I had an issue once where the chipset / mainboard was broken so on one raid1 array I have diferent data was written to the disks occasionally and linux-raid / mdadm did not complain or do anything. Just by coincidence I found this issue that time. I did report this here on february too. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
Dear Neil, The point that I'm trying to make is, that there does exist a specific case, in which recovery is possible, and that implementing recovery for that case will not hurt in any way. Assuming that it true (maybe hpa got it wrong) what specific conditions would lead to one drive having corrupt data, and would correcting it on an occasional 'repair' pass be an appropriate response? The use case for the proposed 'repair' would be occasional, low-frequency corruption, for which many sources can be imagined: Any piece of hardware has a certain failure rate, which may depend on things like age, temperature, stability of operating voltage, cosmic rays, etc. but also on variations in the production process. Therefore, hardware may suffer from infrequent glitches, which are seldom enough, to be impossible to trace back to a particular piece of equipment. It would be nice to recover gracefully from that. Kernel bugs or just plain administrator mistakes are another thing. But also the case of power-loss during writing that you have mentioned could profit from that 'repair': With heterogeneous hardware, blocks may be written in unpredictable order, so that in more cases graceful recovery would be possible with 'repair' compared to just recalculating parity. Does the value justify the cost of extra code complexity? In the case of protecting data integrity, I'd say 'yes'. Everything costs extra. Code uses bytes of memory, requires maintenance, and possibly introduced new bugs. Of course, you are right. However, in my other email, I tried to sketch a piece of code which is very lean as it makes use of functions which I assume to exist. (Sorry, I didn't look at the md code, yet, so please correct me if I'm wrong.) Therefore I assume the costs in memory, maintenance and bugs to be rather low. Kind regards, Thiemo - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
Dear Neil and Eyal, Eyal Lebedinsky wrote: Neil Brown wrote: It would seem that either you or Peter Anvin is mistaken. On page 9 of http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf at the end of section 4 it says: Finally, as a word of caution it should be noted that RAID-6 by itself cannot even detect, never mind recover from, dual-disk corruption. If two disks are corrupt in the same byte positions, the above algorithm will in general introduce additional data corruption by corrupting a third drive. The above a/b/c cases are not correct for raid6. While we can detect 0, 1 or 2 errors, any higher number of errors will be misidentified as one of these. The cases we will always see are: a) no errors - nothing to do b) one error - correct it c) two errors -report? take the raid down? recalc syndromes? and any other case will always appear as *one* of these (not as [c]). I still don't agree. I'll explain the algorithm for error handling that I have in mind, maybe you can point out if I'm mistaken at some point. We have n data blocks D1...Dn and two parities P (XOR) and Q (Reed-Solomon). I assume the existence of two functions to calculate the parities P = calc_P(D1, ..., Dn) Q = calc_Q(D1, ..., Dn) and two functions to recover a missing data block Dx using either parity Dx = recover_P(x, D1, ..., Dx-1, Dx+1, ..., Dn, P) Dx = recover_Q(x, D1, ..., Dx-1, Dx+1, ..., Dn, Q) This pseudo-code should distinguish between a), b) and c) and properly repair case b): P' = calc_P(D1, ..., Dn); Q' = calc_Q(D1, ..., Dn); if (P' == P Q' == Q) { /* case a): zero errors */ return; } if (P' == P Q' != Q) { /* case b1): Q is bad, can be fixed */ Q = Q'; return; } if (P' != P Q' == Q) { /* case b2): P is bad, can be fixed */ P = P'; return; } /* both parities are bad, so we try whether the problem can be fixed by repairing data blocks */ for (i = 1; i = n; n++) { /* assume only Di is bad, use P parity to repair */ D' = recover_P(i, D1, ..., Di-1, Di+1, ..., Dn, P); /* use Q parity to check assumption */ Q' = calc_Q(D1, ..., Di-1, D', Di+1, ..., Dn); if (Q == Q') { /* case b3): Q parity is ok, that means the assumption was correct and we can fix the problem */ Di = D'; return; } } /* case c): when we get here, we have excluded cases a) and b), so now we really have a problem */ report_unrecoverable_error(); return; Concerning misidentification: A situation can be imagined, in which two or more simultaneous corruptions have occurred in a very special way, so that case b3) is diagnosed accidentally. While that is not impossible, I'd assume the probability for it to be negligible, to be compared to that of undetectable corruption in a RAID 5 setup. Kind regards, Thiemo - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
Neil Brown wrote: On Thursday November 22, [EMAIL PROTECTED] wrote: Dear Neil, thank you very much for your detailed answer. Neil Brown wrote: While it is possible to use the RAID6 P+Q information to deduce which data block is wrong if it is known that either 0 or 1 datablocks is wrong, it is *not* possible to deduce which block or blocks are wrong if it is possible that more than 1 data block is wrong. If I'm not mistaken, this is only partly correct. Using P+Q redundancy, it *is* possible, to distinguish three cases: a) exactly zero bad blocks b) exactly one bad block c) more than one bad block Of course, it is only possible to recover from b), but one *can* tell, whether the situation is a) or b) or c) and act accordingly. It would seem that either you or Peter Anvin is mistaken. On page 9 of http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf at the end of section 4 it says: Finally, as a word of caution it should be noted that RAID-6 by itself cannot even detect, never mind recover from, dual-disk corruption. If two disks are corrupt in the same byte positions, the above algorithm will in general introduce additional data corruption by corrupting a third drive. The point that I'm trying to make is, that there does exist a specific case, in which recovery is possible, and that implementing recovery for that case will not hurt in any way. Assuming that it true (maybe hpa got it wrong) what specific conditions would lead to one drive having corrupt data, and would correcting it on an occasional 'repair' pass be an appropriate response? Does the value justify the cost of extra code complexity? RAID is not designed to protect again bad RAM, bad cables, chipset bugs drivers bugs etc. It is only designed to protect against drive failure, where the drive failure is apparent. i.e. a read must return either the same data that was last written, or a failure indication. Anything else is beyond the design parameters for RAID. I'm taking a more pragmatic approach here. In my opinion, RAID should just protect my data, against drive failure, yes, of course, but if it can help me in case of occasional data corruption, I'd happily take that, too, especially if it doesn't cost extra... ;-) Everything costs extra. Code uses bytes of memory, requires maintenance, and possibly introduced new bugs. I'm not convinced the failure mode that you are considering actually happens with a meaningful frequency. People accept the hardware and performance costs of raid-6 in return for the better security of their data. If I run a check and find that I have an error, right now I have to treat that the same way as an unrecoverable failure, because the repair function doesn't fix the data, it just makes the symptom go away by redoing the p and q values. This makes the naive user thinks the problem is solved, when in fact it's now worse, he has corrupt data with no indication of a problem. The fact that (most) people who read this list are advanced enough to understand the issue does not protect the majority of users from their ignorance. If that sounds elitist, many of the people on this list are the elite, and even knowing that you need to learn and understand more is a big plus in my book. It's the people who run repair and assume the problem is fixed who get hurt by the current behavior. If you won't fix the recoverable case by recovering, then maybe for raid-6 you could print an error message like can't recover data, fix parity and hide the problem (y/N)? or require a --force flag, and at least give a heads up to the people who just picked the most reliable raid level because they're trying to do it right, but need a clue that they have a real and serious problem, and just a repair can't fix it. Recovering a filesystem full of just files is pretty easy, that's what backups with CRC are for, but a large database recovery often takes hours to restore and run journal files. I personally consider it the job of the kernel to do recovery when it is possible, absent that I would like the tools to tell me clearly that I have a problem and what it is. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
Neil Brown wrote: On Thursday November 22, [EMAIL PROTECTED] wrote: Dear Neil, thank you very much for your detailed answer. Neil Brown wrote: While it is possible to use the RAID6 P+Q information to deduce which data block is wrong if it is known that either 0 or 1 datablocks is wrong, it is *not* possible to deduce which block or blocks are wrong if it is possible that more than 1 data block is wrong. If I'm not mistaken, this is only partly correct. Using P+Q redundancy, it *is* possible, to distinguish three cases: a) exactly zero bad blocks b) exactly one bad block c) more than one bad block Of course, it is only possible to recover from b), but one *can* tell, whether the situation is a) or b) or c) and act accordingly. It would seem that either you or Peter Anvin is mistaken. On page 9 of http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf at the end of section 4 it says: Finally, as a word of caution it should be noted that RAID-6 by itself cannot even detect, never mind recover from, dual-disk corruption. If two disks are corrupt in the same byte positions, the above algorithm will in general introduce additional data corruption by corrupting a third drive. The above a/b/c cases are not correct for raid6. While we can detect 0, 1 or 2 errors, any higher number of errors will be misidentified as one of these. The cases we will always see are: a) no errors - nothing to do b) one error - correct it c) two errors -report? take the raid down? recalc syndromes? and any other case will always appear as *one* of these (not as [c]). Case [c] is where different users will want to do different things. If my data is highly critical (would I really use raid6 here and not a higher redundancy level?) I could consider doing some investigation. e.g. pick each pair of disks in turn as the faulty ones, correct them and check that my data looks good (fsck? inspect the data visually?) until one pair choice gives good data. may be OT The quote, saying two errors may not be detected, is not how I understand ECC schemes to work. Does anyone have other papers that point this? Also, is it the case that the raid6 alg detects a failed disk (strip) or is it actually detecting failed bits and as such the correction is done to the whole stripe? In other words, values in all failed locations are fixed (when only 1-error cases are present) and not in just one strip. This means that we do not necessarily identify the bad disk, and neither do we need to. -- Eyal Lebedinsky ([EMAIL PROTECTED]) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
On Thursday November 22, [EMAIL PROTECTED] wrote: Dear Neil, thank you very much for your detailed answer. Neil Brown wrote: While it is possible to use the RAID6 P+Q information to deduce which data block is wrong if it is known that either 0 or 1 datablocks is wrong, it is *not* possible to deduce which block or blocks are wrong if it is possible that more than 1 data block is wrong. If I'm not mistaken, this is only partly correct. Using P+Q redundancy, it *is* possible, to distinguish three cases: a) exactly zero bad blocks b) exactly one bad block c) more than one bad block Of course, it is only possible to recover from b), but one *can* tell, whether the situation is a) or b) or c) and act accordingly. It would seem that either you or Peter Anvin is mistaken. On page 9 of http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf at the end of section 4 it says: Finally, as a word of caution it should be noted that RAID-6 by itself cannot even detect, never mind recover from, dual-disk corruption. If two disks are corrupt in the same byte positions, the above algorithm will in general introduce additional data corruption by corrupting a third drive. The point that I'm trying to make is, that there does exist a specific case, in which recovery is possible, and that implementing recovery for that case will not hurt in any way. Assuming that it true (maybe hpa got it wrong) what specific conditions would lead to one drive having corrupt data, and would correcting it on an occasional 'repair' pass be an appropriate response? Does the value justify the cost of extra code complexity? RAID is not designed to protect again bad RAM, bad cables, chipset bugs drivers bugs etc. It is only designed to protect against drive failure, where the drive failure is apparent. i.e. a read must return either the same data that was last written, or a failure indication. Anything else is beyond the design parameters for RAID. I'm taking a more pragmatic approach here. In my opinion, RAID should just protect my data, against drive failure, yes, of course, but if it can help me in case of occasional data corruption, I'd happily take that, too, especially if it doesn't cost extra... ;-) Everything costs extra. Code uses bytes of memory, requires maintenance, and possibly introduced new bugs. I'm not convinced the failure mode that you are considering actually happens with a meaningful frequency. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
On Tuesday November 27, [EMAIL PROTECTED] wrote: Thiemo Nagel wrote: Dear Neil, thank you very much for your detailed answer. Neil Brown wrote: While it is possible to use the RAID6 P+Q information to deduce which data block is wrong if it is known that either 0 or 1 datablocks is wrong, it is *not* possible to deduce which block or blocks are wrong if it is possible that more than 1 data block is wrong. If I'm not mistaken, this is only partly correct. Using P+Q redundancy, it *is* possible, to distinguish three cases: a) exactly zero bad blocks b) exactly one bad block c) more than one bad block Of course, it is only possible to recover from b), but one *can* tell, whether the situation is a) or b) or c) and act accordingly. I was waiting for a response before saying me too, but that's exactly the case, there is a class of failures other than power failure or total device failure which result in just the one identifiable bad sector result. Given that the data needs to be read to realize that it is bad, why not go the extra inch and fix it properly instead of redoing the p+q which just makes the problem invisible rather than fixing it. Obviously this is a subset of all the things which can go wrong, but I suspect it's a sizable subset. Why do think that it is a sizable subset. Disk drives have internal checksum which are designed to prevent corrupted data being returned. If the data is getting corrupt on some buss between the CPU and the media, then I suspect that your problem is big enough that RAID cannot meaningfully solve it, and New hardware plus possibly restore from backup would be the only credible option. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
Thiemo Nagel wrote: Dear Neil, thank you very much for your detailed answer. Neil Brown wrote: While it is possible to use the RAID6 P+Q information to deduce which data block is wrong if it is known that either 0 or 1 datablocks is wrong, it is *not* possible to deduce which block or blocks are wrong if it is possible that more than 1 data block is wrong. If I'm not mistaken, this is only partly correct. Using P+Q redundancy, it *is* possible, to distinguish three cases: a) exactly zero bad blocks b) exactly one bad block c) more than one bad block Of course, it is only possible to recover from b), but one *can* tell, whether the situation is a) or b) or c) and act accordingly. I was waiting for a response before saying me too, but that's exactly the case, there is a class of failures other than power failure or total device failure which result in just the one identifiable bad sector result. Given that the data needs to be read to realize that it is bad, why not go the extra inch and fix it properly instead of redoing the p+q which just makes the problem invisible rather than fixing it. Obviously this is a subset of all the things which can go wrong, but I suspect it's a sizable subset. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
Dear Neil, thank you very much for your detailed answer. Neil Brown wrote: While it is possible to use the RAID6 P+Q information to deduce which data block is wrong if it is known that either 0 or 1 datablocks is wrong, it is *not* possible to deduce which block or blocks are wrong if it is possible that more than 1 data block is wrong. If I'm not mistaken, this is only partly correct. Using P+Q redundancy, it *is* possible, to distinguish three cases: a) exactly zero bad blocks b) exactly one bad block c) more than one bad block Of course, it is only possible to recover from b), but one *can* tell, whether the situation is a) or b) or c) and act accordingly. As it is quite possible for a write to be aborted in the middle (during unexpected power down) with an unknown number of blocks in a given stripe updated but others not, we do not know how many blocks might be wrong so we cannot try to recover some wrong block. As already mentioned, in my opinion, one can distinguish between 0, 1 and 1 bad blocks, and that is sufficient. Doing so would quite possibly corrupt a block that is not wrong. I don't think additional corruption could be introduced, since recovery would only be done for the case of exactly one bad block. [...] As I said above - there is no solution that works in all cases. I fully agree. When more than one block is corrupted, and you don't know which are the corrupted blocks, you're lost. If more that one block is corrupt, and you don't know which ones, then you lose and there is now way around that. Sure. The point that I'm trying to make is, that there does exist a specific case, in which recovery is possible, and that implementing recovery for that case will not hurt in any way. RAID is not designed to protect again bad RAM, bad cables, chipset bugs drivers bugs etc. It is only designed to protect against drive failure, where the drive failure is apparent. i.e. a read must return either the same data that was last written, or a failure indication. Anything else is beyond the design parameters for RAID. I'm taking a more pragmatic approach here. In my opinion, RAID should just protect my data, against drive failure, yes, of course, but if it can help me in case of occasional data corruption, I'd happily take that, too, especially if it doesn't cost extra... ;-) Kind regards, Thiemo - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 check/repair
Dear Neal, I have been looking a bit at the check/repair functionality in the raid6 personality. It seems that if an inconsistent stripe is found during repair, md does not try to determine which block is corrupt (using e.g. the method in section 4 of HPA's raid6 paper), but just recomputes the parity blocks - i.e. the same way as inconsistent raid5 stripes are handled. Correct? Correct! The mostly likely cause of parity being incorrect is if a write to data + P + Q was interrupted when one or two of those had been written, but the other had not. No matter which was or was not written, correctly P and Q will produce a 'correct' result, and it is simple. I really don't see any justification for being more clever. My opinion about that is quite different. Speaking just for myself: a) When I put my data on a RAID running on Linux, I'd expect the software to do everything which is possible to protect and when necessary to restore data integrity. (This expectation was one of the reasons why I chose software RAID with Linux.) b) As a consequence of a): When I'm using a RAID level that has extra redundancy, I'd expect Linux to make use of that extra redundancy during a 'repair'. (Otherwise I'd consider repair a misnomer and rather call it 'recalc parity'.) c) Why should 'repair' be implemented in a way that only works in most cases when there exists a solution that works in all cases? (After all, possibilities for corruption are many, e.g. bad RAM, bad cables, chipset bugs, driver bugs, last but not least human mistake. From all these errors I'd like to be able to recover gracefully without putting the array at risk by removing and readding a component device.) Bottom line: So far I was talking about *my* expectations, is it reasonable to assume that it is shared by others? Are there any arguments that I'm not aware of speaking against an improved implementation of 'repair'? BTW: I just checked, it's the same for RAID 1: When I intentionally corrupt a sector in the first device of a set of 16, 'repair' copies the corrupted data to the 15 remaining devices instead of restoring the correct sector from one of the other fifteen devices to the first. Thank you for your time. Kind regards, Thiemo Nagel begin:vcard fn:Thiemo Nagel n:Nagel;Thiemo org;quoted-printable:Technische Universit=C3=A4t M=C3=BCnchen;Physik Department E18 adr;quoted-printable:;;James-Franck-Stra=C3=9Fe;Garching;;85748;Germany email;internet:[EMAIL PROTECTED] title:Dipl. Phys. tel;work:+49 (0)89 289-12592 x-mozilla-html:FALSE version:2.1 end:vcard
Re: raid6 check/repair
On Wednesday November 21, [EMAIL PROTECTED] wrote: Dear Neal, I have been looking a bit at the check/repair functionality in the raid6 personality. It seems that if an inconsistent stripe is found during repair, md does not try to determine which block is corrupt (using e.g. the method in section 4 of HPA's raid6 paper), but just recomputes the parity blocks - i.e. the same way as inconsistent raid5 stripes are handled. Correct? Correct! The mostly likely cause of parity being incorrect is if a write to data + P + Q was interrupted when one or two of those had been written, but the other had not. No matter which was or was not written, correctly P and Q will produce a 'correct' result, and it is simple. I really don't see any justification for being more clever. My opinion about that is quite different. Speaking just for myself: a) When I put my data on a RAID running on Linux, I'd expect the software to do everything which is possible to protect and when necessary to restore data integrity. (This expectation was one of the reasons why I chose software RAID with Linux.) Yes, of course. possible is an import aspect of this. b) As a consequence of a): When I'm using a RAID level that has extra redundancy, I'd expect Linux to make use of that extra redundancy during a 'repair'. (Otherwise I'd consider repair a misnomer and rather call it 'recalc parity'.) The extra redundancy in RAID6 is there to enable you to survive two drive failure. Nothing more. While it is possible to use the RAID6 P+Q information to deduce which data block is wrong if it is known that either 0 or 1 datablocks is wrong, it is *not* possible to deduce which block or blocks are wrong if it is possible that more than 1 data block is wrong. As it is quite possible for a write to be aborted in the middle (during unexpected power down) with an unknown number of blocks in a given stripe updated but others not, we do not know how many blocks might be wrong so we cannot try to recover some wrong block. Doing so would quite possibly corrupt a block that is not wrong. The repair process repairs the parity (redundancy information). It does not repair the data. It cannot. The only possible scenario that md/raid recognises for the parity information being wrong is the case of an unexpected shutdown in the middle of a stripe write, where some blocks have been written and some have not. Further (for raid 4/5/6), it only supports this case when your array is not degraded. If you have a degraded array, then an unexpected shutdown is potentially fatal to your data (the chances of it actually being fatal is actually quite small, but the potential is still there). There is nothing RAID can do about this. It is not designed to protect against power failure. It is designed to protect again drive failure. It does that quite well. If you have wrong data appearing on your device for some other reason, then you have a serious hardware problem and RAID cannot help you. The best approach to dealing with data on drives getting spontaneously corrupted is for the filesystem to perform strong checksums on the data block, and store the checksums in the indexing information. This provides detection, not recovery of course. c) Why should 'repair' be implemented in a way that only works in most cases when there exists a solution that works in all cases? (After all, possibilities for corruption are many, e.g. bad RAM, bad cables, chipset bugs, driver bugs, last but not least human mistake. From all these errors I'd like to be able to recover gracefully without putting the array at risk by removing and readding a component device.) As I said above - there is no solution that works in all cases. If more that one block is corrupt, and you don't know which ones, then you lose and there is now way around that. RAID is not designed to protect again bad RAM, bad cables, chipset bugs drivers bugs etc. It is only designed to protect against drive failure, where the drive failure is apparent. i.e. a read must return either the same data that was last written, or a failure indication. Anything else is beyond the design parameters for RAID. It might be possible to design a data storage system that was resilient to these sorts of errors. It would be much more sophisticated than RAID though. NeilBrown Bottom line: So far I was talking about *my* expectations, is it reasonable to assume that it is shared by others? Are there any arguments that I'm not aware of speaking against an improved implementation of 'repair'? BTW: I just checked, it's the same for RAID 1: When I intentionally corrupt a sector in the first device of a set of 16, 'repair' copies the corrupted data to the 15 remaining devices instead of restoring the correct sector from one of the other fifteen devices to the first. Thank you for your time. - To unsubscribe from this list: send
Re: raid6 check/repair
On Thursday November 15, [EMAIL PROTECTED] wrote: Hi, I have been looking a bit at the check/repair functionality in the raid6 personality. It seems that if an inconsistent stripe is found during repair, md does not try to determine which block is corrupt (using e.g. the method in section 4 of HPA's raid6 paper), but just recomputes the parity blocks - i.e. the same way as inconsistent raid5 stripes are handled. Correct? Correct! The mostly likely cause of parity being incorrect is if a write to data + P + Q was interrupted when one or two of those had been written, but the other had not. No matter which was or was not written, correctly P and Q will produce a 'correct' result, and it is simple. I really don't see any justification for being more clever. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html