subject:"Re\: raid6 check\/repair"

Re: raid6 check/repair

2007-12-14 Thread Thiemo Nagel


Dear Neil,

this thread has died out, but I'd prefer not to let it end without any 
kind of result being reached.  Therefore, I'm kindly asking you to draw 
a conclusion from the arguments being exchanged:


Concerning the implementation of a 'repair' that can actually recover 
data in some cases instead of just recalculating parity:


Do you

a) oppose the case (patches not accepted)
b) don't care (but potentially accept patches)
c) support it

Thank you very much and kind regards,

Thiemo Nagel
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-12-07 Thread Gabor Gombas

On Wed, Dec 05, 2007 at 03:31:14PM -0500, Bill Davidsen wrote:

 BTW: if this can be done in a user program, mdadm, rather than by code in 
 the kernel, that might well make everyone happy. Okay, realistically less 
 unhappy.

I start to like the idea. Of course you can't repair a running array
from user space (just think about something re-writing the full stripe
while mdadm is trying to fix the old data - you can get the data disks
containing the new data but the fixed disks rewritten with the old
data).

We just need to make the kernel not to try to fix anything but merely
report that something is wrong - but wait, using check instead of
repair does that already.

So the kernel is fine as it is, we just need a simple user-space utility
that can take the components of a non-running array and repair a given
stripe using whatever method is appropriate. Shouldn't be too hard to
write for anyone interested...

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-12-06 Thread Andre Noll

On 15:31, Bill Davidsen wrote:

 Thiemo posted metacode which I find appears correct,

It assumes that _exactly_ one disk has bad data which is hard to verify
in practice. But yes, it's probably the best one can do if both P and
Q happen to be incorrect. IMHO mdadm shouldn't do this automatically
though and should always keep backup copies of the data it overwrites
with good data.

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe


signature.asc
Description: Digital signature

Re: raid6 check/repair

2007-12-05 Thread Bill Davidsen


Peter Grandi wrote:

[ ... on RAID1, ... RAID6 error recovery ... ]

tn The use case for the proposed 'repair' would be occasional,
tn low-frequency corruption, for which many sources can be
tn imagined:

tn Any piece of hardware has a certain failure rate, which may
tn depend on things like age, temperature, stability of
tn operating voltage, cosmic rays, etc. but also on variations
tn in the production process.  Therefore, hardware may suffer
tn from infrequent glitches, which are seldom enough, to be
tn impossible to trace back to a particular piece of equipment.
tn It would be nice to recover gracefully from that.

What has this got to do with RAID6 or RAID in general? I have
been following this discussion with a sense of bewilderment as I
have started to suspect that parts of it are based on a very
large misunderstanding.

tn Kernel bugs or just plain administrator mistakes are another
tn thing.

The biggest administrator mistakes are lack of end-to-end checking
and backups. Those that don't have them wish their storage systems
could detect and recover from arbitrary and otherwise undetected
errors (but see below for bad news on silent corruptions).

tn But also the case of power-loss during writing that you have
tn mentioned could profit from that 'repair': With heterogeneous
tn hardware, blocks may be written in unpredictable order, so
tn that in more cases graceful recovery would be possible with
tn 'repair' compared to just recalculating parity.

Redundant RAID levels are designed to recover only from _reported_
errors that identify precisely where the error is. Recovering from
random block writing is something that seems to me to be quite
outside the scope of a low level virtual storage device layer.

ms I just want to give another suggestion. It may or may not be
ms possible to repair inconsistent arrays but in either way some
ms code there MUST at least warn the administrator that
ms something (may) went wrong.

tn Agreed.

That sounds instead quite extraordinary to me because it is not
clear how to define ''inconsistency'' in the general case never
mind detect it reliably, and never mind knowing when it is found
how to determine which are the good data bits and which are the
bad.

Now I am starting to think that this discussion is based on the
curious assumption that storage subsystems should solve the so
called ''byzantine generals'' problem, that is to operate reliably
in the presence of unreliable communications and storage.
  
I had missed that. In fact, after rereading most of the thread I *still* 
miss that, so perhaps it's not there. What the OP proposed was that in 
the case where there is incorrect data on exactly one chunk in a raid-6 
slice that the incorrect chunk be identified and rewritten with correct 
data. This is based on the assumptions that (a) this case can be 
identified, (b) the correct data value for the chunk can be calculated, 
(c) this only adds processing or i/o overhead when an error condition is 
identified by the existing code, and (d) this can be done without 
significant additional i/o other than rewriting the corrected data.


Given these assumptions the reasons for not adding this logic would seem 
to be (a) one of the assumptions is wrong, (b) it would take a huge 
effort to code or maintain, or (c) it's wrong for raid to fix errors 
other than hardware, even if it could do so. Although I've looked at the 
logic in metadata form, and the code for doing the check now, I realize 
that the assumptions could be wrong, and invite enlightenment. But 
Thiemo posted metacode which I find appears correct, so I don't think 
it's a huge job to code, and since it is in a code path which currently 
always hides an error, it's hard to understand how added code could make 
things worse than they are.


I can actually see the philosophical argument about doing only disk 
errors in raid code, but at least it should be a clear decision made for 
that reason, and not hidden by arguments that this happens rarely. Given 
the state of current hardware, I think virtually all errors happen 
rarely, the problem is that all problems happen occasionally (ref. 
Murphy's Law). We have a tool (check) which finds these problems, why 
not a tools to fix them?


BTW: if this can be done in a user program, mdadm, rather than by code 
in the kernel, that might well make everyone happy. Okay, realistically 
less unhappy.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-12-05 Thread Leif Nixon

[EMAIL PROTECTED] (Peter Grandi) writes:

 ms I just want to give another suggestion. It may or may not be
 ms possible to repair inconsistent arrays but in either way some
 ms code there MUST at least warn the administrator that
 ms something (may) went wrong.

 tn Agreed.

 That sounds instead quite extraordinary to me because it is not
 clear how to define ''inconsistency'' in the general case never
 mind detect it reliably, and never mind knowing when it is found
 how to determine which are the good data bits and which are the
 bad.

I don't quite follow you. Having a basic consistency check utility for
a raid array is to me as obvious as having an fsck utility for a file
system.

 Now I am starting to think that this discussion is based on the
 curious assumption that storage subsystems should solve the so
 called ''byzantine generals'' problem, that is to operate reliably
 in the presence of unreliable communications and storage.

I don't think anyone is proposing to solve that problem. However, an
occasional slight nod in acknowledgment of the fact that real world
communications and storage *are* unreliable wouldn't be out of place.

-- 
Leif Nixon   -Systems expert

National Supercomputer Centre-  Linkoping University

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-12-04 Thread Peter Grandi

[ ... on RAID1, ... RAID6 error recovery ... ]

tn The use case for the proposed 'repair' would be occasional,
tn low-frequency corruption, for which many sources can be
tn imagined:

tn Any piece of hardware has a certain failure rate, which may
tn depend on things like age, temperature, stability of
tn operating voltage, cosmic rays, etc. but also on variations
tn in the production process.  Therefore, hardware may suffer
tn from infrequent glitches, which are seldom enough, to be
tn impossible to trace back to a particular piece of equipment.
tn It would be nice to recover gracefully from that.

What has this got to do with RAID6 or RAID in general? I have
been following this discussion with a sense of bewilderment as I
have started to suspect that parts of it are based on a very
large misunderstanding.

tn Kernel bugs or just plain administrator mistakes are another
tn thing.

The biggest administrator mistakes are lack of end-to-end checking
and backups. Those that don't have them wish their storage systems
could detect and recover from arbitrary and otherwise undetected
errors (but see below for bad news on silent corruptions).

tn But also the case of power-loss during writing that you have
tn mentioned could profit from that 'repair': With heterogeneous
tn hardware, blocks may be written in unpredictable order, so
tn that in more cases graceful recovery would be possible with
tn 'repair' compared to just recalculating parity.

Redundant RAID levels are designed to recover only from _reported_
errors that identify precisely where the error is. Recovering from
random block writing is something that seems to me to be quite
outside the scope of a low level virtual storage device layer.

ms I just want to give another suggestion. It may or may not be
ms possible to repair inconsistent arrays but in either way some
ms code there MUST at least warn the administrator that
ms something (may) went wrong.

tn Agreed.

That sounds instead quite extraordinary to me because it is not
clear how to define ''inconsistency'' in the general case never
mind detect it reliably, and never mind knowing when it is found
how to determine which are the good data bits and which are the
bad.

Now I am starting to think that this discussion is based on the
curious assumption that storage subsystems should solve the so
called ''byzantine generals'' problem, that is to operate reliably
in the presence of unreliable communications and storage.

ms I had an issue once where the chipset / mainboard was broken
ms so on one raid1 array I have diferent data was written to the
ms disks occasionally [ ... ]

Indeed. Some links from a web search:

  http://en.Wikipedia.org/wiki/Byzantine_Fault_Tolerance
  http://pages.CS.Wisc.edu/~sschang/OS-Qual/reliability/byzantine.htm
  http://research.Microsoft.com/users/lamport/pubs/byz.pdf

ms and linux-raid / mdadm did not complain or do anything.

The mystic version of Linux-RAID is in psi-test right now :-).


To me RAID does not seem the right abstraction level to deal with
this problem; and perhaps the file system level is not either,
even if ZFS tries to address some of the problem.

However there are ominous signs that the storage version of the
Byzantine generals problem is happening in particularly nasty
forms. For example as reported in this very very scary paper:

  
https://InDiCo.DESY.DE/contributionDisplay.py?contribId=65sessionId=42confId=257

where some of the causes have been apparently identified recently,
see slides 11, 12 and 13:

  
http://InDiCo.FNAL.gov/contributionDisplay.py?contribId=44amp;sessionId=15amp;confId=805

So I guess that end-to-end verification will have to become more
common, but which form it will take is not clear (I always use a
checksummed container format for important long term data).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-12-04 Thread Mikael Abrahamsson


On Tue, 4 Dec 2007, Peter Grandi wrote:


ms and linux-raid / mdadm did not complain or do anything.

The mystic version of Linux-RAID is in psi-test right now :-).

To me RAID does not seem the right abstraction level to deal with
this problem; and perhaps the file system level is not either,
even if ZFS tries to address some of the problem.


Hm. If I run a check on a raid1, I would expect it to read data from 
both disks and compare them, and complain if it's not identical. Are you 
sure you really mean what you're saying here?


I do realise that if the corruption happens above the raid layer then 
there is nothing we can do, but if md asks to write a block to two raid1 
disks and the system corrupts the write and writes different data to the 
two different drives in the raid1, then when md does check at a later time 
and discovers this, it should scream bloody murder, choose one of the data 
and replicate it to the other one...? I know this might as well be the 
wrong data, but md can't figure that out, but it should correct the 
*raid1* inconsistancy, which I think is what the person you replied to 
meant?


--
Mikael Abrahamssonemail: [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-12-03 Thread Thiemo Nagel


Dear Michael,

Michael Schmitt wrote:

Hi folks,


Probably erroneously, you have sent this mail only to me, not to the list...


I just want to give another suggestion. It may or may not be possible
to repair inconsistent arrays but in either way some code there MUST
at least warn the administrator that something (may) went wrong.


Agreed.


I don't know if linux-raid is the right code to implement this,
but I think it is the most obvious place to implement it I guess.


It's on the todo-list, I think, judging from this Debian bug:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=405919

Kind regards,

Thiemo Nagel



I thought this suggestion was once noted in this thread but I am not 
sure and I did not find it anymore, so please bare with me if I wrote
it again. I had an issue once where the chipset / mainboard was 
broken so on one raid1 array I have diferent data was written to the 
disks occasionally and linux-raid / mdadm did not complain or do 
anything. Just by coincidence I found this issue that time. I did 
report this here on february too.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-30 Thread Thiemo Nagel


Dear Neil,


The point that I'm trying to make is, that there does exist a specific
case, in which recovery is possible, and that implementing recovery for
that case will not hurt in any way.


Assuming that it true (maybe hpa got it wrong) what specific
conditions would lead to one drive having corrupt data, and would
correcting it on an occasional 'repair' pass be an appropriate
response?


The use case for the proposed 'repair' would be occasional,
low-frequency corruption, for which many sources can be imagined:

Any piece of hardware has a certain failure rate, which may depend on
things like age, temperature, stability of operating voltage, cosmic
rays, etc. but also on variations in the production process.  Therefore,
hardware may suffer from infrequent glitches, which are seldom enough,
to be impossible to trace back to a particular piece of equipment.  It
would be nice to recover gracefully from that.

Kernel bugs or just plain administrator mistakes are another thing.

But also the case of power-loss during writing that you have mentioned
could profit from that 'repair':  With heterogeneous hardware, blocks
may be written in unpredictable order, so that in more cases graceful
recovery would be possible with 'repair' compared to just recalculating
parity.


Does the value justify the cost of extra code complexity?


In the case of protecting data integrity, I'd say 'yes'.


Everything costs extra.  Code uses bytes of memory, requires
maintenance, and possibly introduced new bugs.


Of course, you are right.  However, in my other email, I tried to sketch
a piece of code which is very lean as it makes use of functions which I
assume to exist.  (Sorry, I didn't look at the md code, yet, so please
correct me if I'm wrong.)  Therefore I assume the costs in memory,
maintenance and bugs to be rather low.

Kind regards,

Thiemo

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-30 Thread Thiemo Nagel


Dear Neil and Eyal,

Eyal Lebedinsky wrote:
 Neil Brown wrote:
 It would seem that either you or Peter Anvin is mistaken.

 On page 9 of
 http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
 at the end of section 4 it says:

 Finally, as a word of caution it should be noted that RAID-6 by
 itself cannot even detect, never mind recover from, dual-disk
 corruption. If two disks are corrupt in the same byte positions,
 the above algorithm will in general introduce additional data
 corruption by corrupting a third drive.

 The above a/b/c cases are not correct for raid6. While we can detect
 0, 1 or 2 errors, any higher number of errors will be misidentified as
 one of these.

 The cases we will always see are:
 a) no  errors - nothing to do
 b) one error - correct it
 c) two errors -report? take the raid down? recalc syndromes?
 and any other case will always appear as *one* of these (not as [c]).

I still don't agree.  I'll explain the algorithm for error handling that
I have in mind, maybe you can point out if I'm mistaken at some point.

We have n data blocks D1...Dn and two parities P (XOR) and Q
(Reed-Solomon).  I assume the existence of two functions to calculate
the parities
P = calc_P(D1, ..., Dn)
Q = calc_Q(D1, ..., Dn)
and two functions to recover a missing data block Dx using either parity
Dx = recover_P(x, D1, ..., Dx-1, Dx+1, ..., Dn, P)
Dx = recover_Q(x, D1, ..., Dx-1, Dx+1, ..., Dn, Q)

This pseudo-code should distinguish between a), b) and c) and properly
repair case b):

P' = calc_P(D1, ..., Dn);
Q' = calc_Q(D1, ..., Dn);
if (P' == P  Q' == Q) {
  /* case a): zero errors */
  return;
}
if (P' == P  Q' != Q) {
  /* case b1): Q is bad, can be fixed */
  Q = Q';
  return;
}
if (P' != P  Q' == Q) {
  /* case b2): P is bad, can be fixed */
  P = P';
  return;
}
/* both parities are bad, so we try whether the problem can
   be fixed by repairing data blocks */
for (i = 1; i = n; n++) {
  /* assume only Di is bad, use P parity to repair */
  D' = recover_P(i, D1, ..., Di-1, Di+1, ..., Dn, P);
  /* use Q parity to check assumption */
  Q' = calc_Q(D1, ..., Di-1, D', Di+1, ..., Dn);
  if (Q == Q') {
/* case b3): Q parity is ok, that means the assumption was
   correct and we can fix the problem */
Di = D';
return;
  }
}
/* case c): when we get here, we have excluded cases a) and b),
   so now we really have a problem */
report_unrecoverable_error();
return;


Concerning misidentification:  A situation can be imagined, in which two 
or more simultaneous corruptions have occurred in a very special way, so 
that case b3) is diagnosed accidentally.  While that is not impossible, 
I'd assume the probability for it to be negligible, to be compared to 
that of undetectable corruption in a RAID 5 setup.


Kind regards,

Thiemo
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-29 Thread Bill Davidsen


Neil Brown wrote:

On Thursday November 22, [EMAIL PROTECTED] wrote:
  

Dear Neil,

thank you very much for your detailed answer.

Neil Brown wrote:


While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is 
wrong, it is *not* possible to deduce which block or blocks are wrong

if it is possible that more than 1 data block is wrong.
  

If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
it *is* possible, to distinguish three cases:
a) exactly zero bad blocks
b) exactly one bad block
c) more than one bad block

Of course, it is only possible to recover from b), but one *can* tell,
whether the situation is a) or b) or c) and act accordingly.



It would seem that either you or Peter Anvin is mistaken.

On page 9 of 
  http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

at the end of section 4 it says:

  Finally, as a word of caution it should be noted that RAID-6 by
  itself cannot even detect, never mind recover from, dual-disk
  corruption. If two disks are corrupt in the same byte positions,
  the above algorithm will in general introduce additional data
  corruption by corrupting a third drive.

  

The point that I'm trying to make is, that there does exist a specific
case, in which recovery is possible, and that implementing recovery for
that case will not hurt in any way.



Assuming that it true (maybe hpa got it wrong) what specific
conditions would lead to one drive having corrupt data, and would
correcting it on an occasional 'repair' pass be an appropriate
response?

Does the value justify the cost of extra code complexity?

  
RAID is not designed to protect again bad RAM, bad cables, chipset 
bugs drivers bugs etc.  It is only designed to protect against drive 
failure, where the drive failure is apparent.  i.e. a read must 
return either the same data that was last written, or a failure 
indication. Anything else is beyond the design parameters for RAID.
  

I'm taking a more pragmatic approach here.  In my opinion, RAID should
just protect my data, against drive failure, yes, of course, but if it
can help me in case of occasional data corruption, I'd happily take
that, too, especially if it doesn't cost extra... ;-)



Everything costs extra.  Code uses bytes of memory, requires
maintenance, and possibly introduced new bugs.  I'm not convinced the
failure mode that you are considering actually happens with a
meaningful frequency.
  


People accept the hardware and performance costs of raid-6 in return for 
the better security of their data. If I run a check and find that I have 
an error, right now I have to treat that the same way as an 
unrecoverable failure, because the repair function doesn't fix the 
data, it just makes the symptom go away by redoing the p and q values.


This makes the naive user thinks the problem is solved, when in fact 
it's now worse, he has corrupt data with no indication of a problem. The 
fact that (most) people who read this list are advanced enough to 
understand the issue does not protect the majority of users from their 
ignorance. If that sounds elitist, many of the people on this list are 
the elite, and even knowing that you need to learn and understand more 
is a big plus in my book. It's the people who run repair and assume the 
problem is fixed who get hurt by the current behavior.


If you won't fix the recoverable case by recovering, then maybe for 
raid-6 you could print an error message like

 can't recover data, fix parity and hide the problem (y/N)?
or require a --force flag, and at least give a heads up to the people 
who just picked the most reliable raid level because they're trying to 
do it right, but need a clue that they have a real and serious problem, 
and just a repair can't fix it.


Recovering a filesystem full of just files is pretty easy, that's what 
backups with CRC are for, but a large database recovery often takes 
hours to restore and run journal files. I personally consider it the job 
of the kernel to do recovery when it is possible, absent that I would 
like the tools to tell me clearly that I have a problem and what it is.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-29 Thread Eyal Lebedinsky


Neil Brown wrote:

On Thursday November 22, [EMAIL PROTECTED] wrote:

Dear Neil,

thank you very much for your detailed answer.

Neil Brown wrote:

While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is 
wrong, it is *not* possible to deduce which block or blocks are wrong

if it is possible that more than 1 data block is wrong.

If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
it *is* possible, to distinguish three cases:
a) exactly zero bad blocks
b) exactly one bad block
c) more than one bad block

Of course, it is only possible to recover from b), but one *can* tell,
whether the situation is a) or b) or c) and act accordingly.


It would seem that either you or Peter Anvin is mistaken.

On page 9 of 
  http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

at the end of section 4 it says:

  Finally, as a word of caution it should be noted that RAID-6 by
  itself cannot even detect, never mind recover from, dual-disk
  corruption. If two disks are corrupt in the same byte positions,
  the above algorithm will in general introduce additional data
  corruption by corrupting a third drive.


The above a/b/c cases are not correct for raid6. While we can detect
0, 1 or 2 errors, any higher number of errors will be misidentified as
one of these.

The cases we will always see are:
a) no  errors - nothing to do
b) one error - correct it
c) two errors -report? take the raid down? recalc syndromes?
and any other case will always appear as *one* of these (not as [c]).

Case [c] is where different users will want to do different things. If my data
is highly critical (would I really use raid6 here and not a higher redundancy
level?) I could consider doing some investigation. e.g. pick each pair of disks
in turn as the faulty ones, correct them and check that my data looks good
(fsck? inspect the data visually?) until one pair choice gives good data.

may be OT

The quote, saying two errors may not be detected, is not how I understand
ECC schemes to work. Does anyone have other papers that point this?

Also, is it the case that the raid6 alg detects a failed disk (strip)
or is it actually detecting failed bits and as such the correction is
done to the whole stripe? In other words, values in all failed locations
are fixed (when only 1-error cases are present) and not in just one
strip. This means that we do not necessarily identify the bad disk, and
neither do we need to.

--
Eyal Lebedinsky ([EMAIL PROTECTED])
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-28 Thread Neil Brown

On Thursday November 22, [EMAIL PROTECTED] wrote:
 Dear Neil,
 
 thank you very much for your detailed answer.
 
 Neil Brown wrote:
  While it is possible to use the RAID6 P+Q information to deduce which
  data block is wrong if it is known that either 0 or 1 datablocks is 
  wrong, it is *not* possible to deduce which block or blocks are wrong
  if it is possible that more than 1 data block is wrong.
 
 If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
 it *is* possible, to distinguish three cases:
 a) exactly zero bad blocks
 b) exactly one bad block
 c) more than one bad block
 
 Of course, it is only possible to recover from b), but one *can* tell,
 whether the situation is a) or b) or c) and act accordingly.

It would seem that either you or Peter Anvin is mistaken.

On page 9 of 
  http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
at the end of section 4 it says:

  Finally, as a word of caution it should be noted that RAID-6 by
  itself cannot even detect, never mind recover from, dual-disk
  corruption. If two disks are corrupt in the same byte positions,
  the above algorithm will in general introduce additional data
  corruption by corrupting a third drive.

 
 The point that I'm trying to make is, that there does exist a specific
 case, in which recovery is possible, and that implementing recovery for
 that case will not hurt in any way.

Assuming that it true (maybe hpa got it wrong) what specific
conditions would lead to one drive having corrupt data, and would
correcting it on an occasional 'repair' pass be an appropriate
response?

Does the value justify the cost of extra code complexity?

 
  RAID is not designed to protect again bad RAM, bad cables, chipset 
  bugs drivers bugs etc.  It is only designed to protect against drive 
  failure, where the drive failure is apparent.  i.e. a read must 
  return either the same data that was last written, or a failure 
  indication. Anything else is beyond the design parameters for RAID.
 
 I'm taking a more pragmatic approach here.  In my opinion, RAID should
 just protect my data, against drive failure, yes, of course, but if it
 can help me in case of occasional data corruption, I'd happily take
 that, too, especially if it doesn't cost extra... ;-)

Everything costs extra.  Code uses bytes of memory, requires
maintenance, and possibly introduced new bugs.  I'm not convinced the
failure mode that you are considering actually happens with a
meaningful frequency.

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-28 Thread Neil Brown

On Tuesday November 27, [EMAIL PROTECTED] wrote:
 Thiemo Nagel wrote:
  Dear Neil,
 
  thank you very much for your detailed answer.
 
  Neil Brown wrote:
  While it is possible to use the RAID6 P+Q information to deduce which
  data block is wrong if it is known that either 0 or 1 datablocks is 
  wrong, it is *not* possible to deduce which block or blocks are wrong
  if it is possible that more than 1 data block is wrong.
 
  If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
  it *is* possible, to distinguish three cases:
  a) exactly zero bad blocks
  b) exactly one bad block
  c) more than one bad block
 
  Of course, it is only possible to recover from b), but one *can* tell,
  whether the situation is a) or b) or c) and act accordingly.
 I was waiting for a response before saying me too, but that's exactly 
 the case, there is a class of failures other than power failure or total 
 device failure which result in just the one identifiable bad sector 
 result. Given that the data needs to be read to realize that it is bad, 
 why not go the extra inch and fix it properly instead of redoing the p+q 
 which just makes the problem invisible rather than fixing it.
 
 Obviously this is a subset of all the things which can go wrong, but I 
 suspect it's a sizable subset.

Why do think that it is a sizable subset.  Disk drives have internal
checksum which are designed to prevent corrupted data being returned.

If the data is getting corrupt on some buss between the CPU and the
media, then I suspect that your problem is big enough that RAID cannot
meaningfully solve it, and New hardware plus possibly restore from
backup would be the only credible option.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-26 Thread Bill Davidsen


Thiemo Nagel wrote:

Dear Neil,

thank you very much for your detailed answer.

Neil Brown wrote:

While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is 
wrong, it is *not* possible to deduce which block or blocks are wrong

if it is possible that more than 1 data block is wrong.


If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
it *is* possible, to distinguish three cases:
a) exactly zero bad blocks
b) exactly one bad block
c) more than one bad block

Of course, it is only possible to recover from b), but one *can* tell,
whether the situation is a) or b) or c) and act accordingly.
I was waiting for a response before saying me too, but that's exactly 
the case, there is a class of failures other than power failure or total 
device failure which result in just the one identifiable bad sector 
result. Given that the data needs to be read to realize that it is bad, 
why not go the extra inch and fix it properly instead of redoing the p+q 
which just makes the problem invisible rather than fixing it.


Obviously this is a subset of all the things which can go wrong, but I 
suspect it's a sizable subset.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-22 Thread Thiemo Nagel


Dear Neil,

thank you very much for your detailed answer.

Neil Brown wrote:

While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is 
wrong, it is *not* possible to deduce which block or blocks are wrong

if it is possible that more than 1 data block is wrong.


If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
it *is* possible, to distinguish three cases:
a) exactly zero bad blocks
b) exactly one bad block
c) more than one bad block

Of course, it is only possible to recover from b), but one *can* tell,
whether the situation is a) or b) or c) and act accordingly.

As it is quite possible for a write to be aborted in the middle 
(during unexpected power down) with an unknown number of blocks in a 
given stripe updated but others not, we do not know how many blocks 
might be wrong so we cannot try to recover some wrong block.


As already mentioned, in my opinion, one can distinguish between 0, 1
and 1 bad blocks, and that is sufficient.


Doing so would quite possibly corrupt a block that is not wrong.


I don't think additional corruption could be introduced, since recovery
would only be done for the case of exactly one bad block.



[...]

As I said above - there is no solution that works in all cases.


I fully agree.  When more than one block is corrupted, and you don't 
know which are the corrupted blocks, you're lost.


If more that one block is corrupt, and you don't know which ones, 
then you lose and there is now way around that.


Sure.

The point that I'm trying to make is, that there does exist a specific
case, in which recovery is possible, and that implementing recovery for
that case will not hurt in any way.

RAID is not designed to protect again bad RAM, bad cables, chipset 
bugs drivers bugs etc.  It is only designed to protect against drive 
failure, where the drive failure is apparent.  i.e. a read must 
return either the same data that was last written, or a failure 
indication. Anything else is beyond the design parameters for RAID.


I'm taking a more pragmatic approach here.  In my opinion, RAID should
just protect my data, against drive failure, yes, of course, but if it
can help me in case of occasional data corruption, I'd happily take
that, too, especially if it doesn't cost extra... ;-)

Kind regards,

Thiemo

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-21 Thread Thiemo Nagel


Dear Neal,


I have been looking a bit at the check/repair functionality in the
raid6 personality.

It seems that if an inconsistent stripe is found during repair, md
does not try to determine which block is corrupt (using e.g. the
method in section 4 of HPA's raid6 paper), but just recomputes the
parity blocks - i.e. the same way as inconsistent raid5 stripes are
handled.

Correct?


Correct!

The mostly likely cause of parity being incorrect is if a write to
data + P + Q was interrupted when one or two of those had been
written, but the other had not.

No matter which was or was not written, correctly P and Q will produce
a 'correct' result, and it is simple.  I really don't see any
justification for being more clever.


My opinion about that is quite different.  Speaking just for myself:

a) When I put my data on a RAID running on Linux, I'd expect the 
software to do everything which is possible to protect and when 
necessary to restore data integrity.  (This expectation was one of the 
reasons why I chose software RAID with Linux.)


b) As a consequence of a):  When I'm using a RAID level that has extra 
redundancy, I'd expect Linux to make use of that extra redundancy during 
a 'repair'.  (Otherwise I'd consider repair a misnomer and rather call 
it 'recalc parity'.)


c) Why should 'repair' be implemented in a way that only works in most 
cases when there exists a solution that works in all cases?  (After all, 
possibilities for corruption are many, e.g. bad RAM, bad cables, chipset 
bugs, driver bugs, last but not least human mistake.  From all these 
errors I'd like to be able to recover gracefully without putting the 
array at risk by removing and readding a component device.)


Bottom line:  So far I was talking about *my* expectations, is it 
reasonable to assume that it is shared by others?  Are there any 
arguments that I'm not aware of speaking against an improved 
implementation of 'repair'?


BTW:  I just checked, it's the same for RAID 1:  When I intentionally 
corrupt a sector in the first device of a set of 16, 'repair' copies the 
corrupted data to the 15 remaining devices instead of restoring the 
correct sector from one of the other fifteen devices to the first.


Thank you for your time.

Kind regards,

Thiemo Nagel
begin:vcard
fn:Thiemo Nagel
n:Nagel;Thiemo
org;quoted-printable:Technische Universit=C3=A4t M=C3=BCnchen;Physik Department E18
adr;quoted-printable:;;James-Franck-Stra=C3=9Fe;Garching;;85748;Germany
email;internet:[EMAIL PROTECTED]
title:Dipl. Phys.
tel;work:+49 (0)89 289-12592
x-mozilla-html:FALSE
version:2.1
end:vcard

Re: raid6 check/repair

2007-11-21 Thread Neil Brown

On Wednesday November 21, [EMAIL PROTECTED] wrote:
 Dear Neal,
 
  I have been looking a bit at the check/repair functionality in the
  raid6 personality.
  
  It seems that if an inconsistent stripe is found during repair, md
  does not try to determine which block is corrupt (using e.g. the
  method in section 4 of HPA's raid6 paper), but just recomputes the
  parity blocks - i.e. the same way as inconsistent raid5 stripes are
  handled.
  
  Correct?
  
  Correct!
  
  The mostly likely cause of parity being incorrect is if a write to
  data + P + Q was interrupted when one or two of those had been
  written, but the other had not.
  
  No matter which was or was not written, correctly P and Q will produce
  a 'correct' result, and it is simple.  I really don't see any
  justification for being more clever.
 
 My opinion about that is quite different.  Speaking just for myself:
 
 a) When I put my data on a RAID running on Linux, I'd expect the 
 software to do everything which is possible to protect and when 
 necessary to restore data integrity.  (This expectation was one of the 
 reasons why I chose software RAID with Linux.)

Yes, of course.  possible is an import aspect of this.

 
 b) As a consequence of a):  When I'm using a RAID level that has extra 
 redundancy, I'd expect Linux to make use of that extra redundancy during 
 a 'repair'.  (Otherwise I'd consider repair a misnomer and rather call 
 it 'recalc parity'.)

The extra redundancy in RAID6 is there to enable you to survive two
drive failure.  Nothing more.

While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is
wrong, it is *not* possible to deduce which block or blocks are wrong
if it is possible that more than 1 data block is wrong.
As it is quite possible for a write to be aborted in the middle
(during unexpected power down) with an unknown number of blocks in a
given stripe updated but others not, we do not know how many blocks
might be wrong so we cannot try to recover some wrong block.  Doing
so would quite possibly corrupt a block that is not wrong.

The repair process repairs the parity (redundancy information).
It does not repair the data.  It cannot.

The only possible scenario that md/raid recognises for the parity
information being wrong is the case of an unexpected shutdown in the
middle of a stripe write, where some blocks have been written and some
have not.
Further (for raid 4/5/6), it only supports this case when your array
is not degraded.  If you have a degraded array, then an unexpected
shutdown is potentially fatal to your data (the chances of it actually
being fatal is actually quite small, but the potential is still there).
There is nothing RAID can do about this.  It is not designed to
protect against power failure.  It is designed to protect again drive
failure.  It does that quite well.

If you have wrong data appearing on your device for some other reason,
then you have a serious hardware problem and RAID cannot help you.

The best approach to dealing with data on drives getting spontaneously
corrupted is for the filesystem to perform strong checksums on the
data block, and store the checksums in the indexing information.  This
provides detection, not recovery of course.

 
 c) Why should 'repair' be implemented in a way that only works in most 
 cases when there exists a solution that works in all cases?  (After all, 
 possibilities for corruption are many, e.g. bad RAM, bad cables, chipset 
 bugs, driver bugs, last but not least human mistake.  From all these 
 errors I'd like to be able to recover gracefully without putting the 
 array at risk by removing and readding a component device.)

As I said above - there is no solution that works in all cases.  If
more that one block is corrupt, and you don't know which ones, then
you lose and there is now way around that.
RAID is not designed to protect again bad RAM, bad cables, chipset
bugs drivers bugs etc.  It is only designed to protect against drive
failure, where the drive failure is apparent.  i.e. a read must return
either the same data that was last written, or a failure indication.
Anything else is beyond the design parameters for RAID.
It might be possible to design a data storage system that was
resilient to these sorts of errors.  It would be much more
sophisticated than RAID though.

NeilBrown


 
 Bottom line:  So far I was talking about *my* expectations, is it 
 reasonable to assume that it is shared by others?  Are there any 
 arguments that I'm not aware of speaking against an improved 
 implementation of 'repair'?
 
 BTW:  I just checked, it's the same for RAID 1:  When I intentionally 
 corrupt a sector in the first device of a set of 16, 'repair' copies the 
 corrupted data to the 15 remaining devices instead of restoring the 
 correct sector from one of the other fifteen devices to the first.
 
 Thank you for your time.
 
-
To unsubscribe from this list: send

Re: raid6 check/repair

2007-11-15 Thread Neil Brown

On Thursday November 15, [EMAIL PROTECTED] wrote:
 Hi,
 
 I have been looking a bit at the check/repair functionality in the
 raid6 personality.
 
 It seems that if an inconsistent stripe is found during repair, md
 does not try to determine which block is corrupt (using e.g. the
 method in section 4 of HPA's raid6 paper), but just recomputes the
 parity blocks - i.e. the same way as inconsistent raid5 stripes are
 handled.
 
 Correct?

Correct!

The mostly likely cause of parity being incorrect is if a write to
data + P + Q was interrupted when one or two of those had been
written, but the other had not.

No matter which was or was not written, correctly P and Q will produce
a 'correct' result, and it is simple.  I really don't see any
justification for being more clever.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

Re: raid6 check/repair

19 matches

Site Navigation

Mail list logo

Footer information