Re: On the subject of RAID-6 corruption recovery

2008-01-07 Thread H. Peter Anvin

Mattias Wadenstein wrote:

On Mon, 7 Jan 2008, Thiemo Nagel wrote:

What you call pathologic cases when it comes to real-world data are 
very common.  It is not at all unusual to find sectors filled with 
only a constant (usually zero, but not always), in which case your 
**512 becomes **1.


Of course it would be easy to check how many of the 512 Bytes are 
really different on a case-by-case basis and correct the exponent 
accordingly, and only perform the recovery when the corrected 
probability of introducing an error is sufficiently low.


What is the alternative to recovery, really? Just erroring out and 
letting the admin deal with it, or blindly assume that the parity is wrong?




Erroring out.  Only thing to do at that point.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: On the subject of RAID-6 corruption recovery

2008-01-04 Thread H. Peter Anvin

Thiemo Nagel wrote:


Inverting your argumentation, that means when we don't see z = n or
inconsistent z numbers, multidisc corruption can be excluded statistically.

For errors occurring on the level of hard disk blocks (signature: most
bytes of the block have D errors, all with same z), the probability for
multidisc corruption to go undetected is ((n-1)/256)**512.  This might
pose a problem in the limiting case of n=255, however for practical
applications, this probability is negligible as it drops off
exponentially with decreasing n:



That assumes fully random data distribution, which is almost certainly a 
false assumption.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: On the subject of RAID-6 corruption recovery

2008-01-04 Thread H. Peter Anvin

Thiemo Nagel wrote:

For errors occurring on the level of hard disk blocks (signature: most
bytes of the block have D errors, all with same z), the probability for
multidisc corruption to go undetected is ((n-1)/256)**512.  This might
pose a problem in the limiting case of n=255, however for practical
applications, this probability is negligible as it drops off
exponentially with decreasing n:


That assumes fully random data distribution, which is almost certainly a
false assumption.


Agreed.  This means, that the formula only serves to specify a lower limit
to the probability.  However, is there an argumentation, why a pathologic
case would be probable, i.e. why the probability would be likely to
*vastly* deviate from the theoretical limit?  And if there is, would that
argumentation not apply to other raid 6 operations (like check) also? 
And would it help to use different Galois field generators at different

positions in a sector instead of using a uniform generator?



What you call pathologic cases when it comes to real-world data are 
very common.  It is not at all unusual to find sectors filled with only 
a constant (usually zero, but not always), in which case your **512 
becomes **1.


It doesn't mean it's not worthwhile, but don't try to claim it is 
anything other than opportunistic.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: On the subject of RAID-6 corruption recovery

2008-01-04 Thread H. Peter Anvin

Thiemo Nagel wrote:


That's why I was asking about the generator.  Theoretically, this
situation might be countered by using a (pseudo-)random pattern of
generators for the different bytes of a sector, though I'm not sure
whether it is worth the effort.



Changing the generator is mathematically equivalent to changing the 
order of the drives, so no, that wouldn't help (and would make the 
common computations a lot more expensive.)


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: On the subject of RAID-6 corruption recovery

2007-12-28 Thread H. Peter Anvin

Bill Davidsen wrote:

H. Peter Anvin wrote:
I got a private email a while ago from Thiemo Nagel claiming that some 
of the conclusions in my RAID-6 paper was incorrect.  This was 
combined with a proof which was plain wrong, and could easily be 
disproven using basic enthropy accounting (i.e. how much information 
is around to play with.)


However, it did cause me to clarify the text portion a little bit.  In 
particular, *in practice* in may be possible to *probabilistically* 
detect multidisk corruption.  Probabilistic detection means that the 
detection is not guaranteed, but it can be taken advantage of 
opportunistically.


If this means that there can be no false positives for multidisk 
corruption but may be false negatives, fine. If it means something else, 
please restate one more time.




Pretty much.  False negatives are quite serious, since they will imply a 
course of action which will introduce further corruption.


-hpa

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [md-raid6-accel PATCH 01/12] async_tx: PQXOR implementation

2007-12-27 Thread H. Peter Anvin

Yuri Tikhonov wrote:

 This patch implements support for the asynchronous computation of RAID-6
syndromes.

 It provides an API to compute RAID-6 syndromes asynchronously in a format
conforming to async_tx interfaces. The async_pxor and async_pqxor_zero_sum
functions are very similar to async_xor functions but make use of
additional tx_set_src_mult method for setting cooefficients of the RAID-6
Q syndrome.

 The Galois polynomial which is used in the s/w case is 0x11d (the
corresponding coefficients are hard-coded in raid6_call.gen_syndrome).
Because even with the h/w acceleration enabled some pqxor operations may be
processed in CPU (e.g. in case of no DMA descriptors available) it's highly
recommended to configure the DMA engine which your system uses to exploit
exactly the same Galois polynomial.



It should probably be noted here, too, that if you use a different basis 
polynomial for the Galois field you will end up with a different on-disk 
format.


+ * You should have received a copy of the GNU General Public License 
along with

+ * this program; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place - Suite 330, Boston, MA  02111-1307, USA.

This address, I believe, is obsolete.

+   if (!(tx=async_pqxor(NULL, ptrs[failb],
+   ptrs[disks - 2], bc, 0, 2, bytes,
+   ASYNC_TX_DEP_ACK | ASYNC_TX_XOR_ZERO_DST,
+   tx, NULL, NULL))) {
+   /* It's bad if we failed here; try to repeat this
+* using another failed disk as a spare; this wouldn't
+* failed since now we'll be able to compute synchronously
+* (there is no support for synchronous Q-only)
+*/
+   async_pqxor(ptrs[faila], ptrs[failb],
+   ptrs[disks - 2], bc, 0, 2, bytes,
+   ASYNC_TX_DEP_ACK | ASYNC_TX_XOR_ZERO_DST,
+   NULL, NULL, NULL);
+   }

I don't really understand this logic, or the comment that goes along 
with it.  Could you please elucidate?


-hpa

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


On the subject of RAID-6 corruption recovery

2007-12-27 Thread H. Peter Anvin
I got a private email a while ago from Thiemo Nagel claiming that some 
of the conclusions in my RAID-6 paper was incorrect.  This was combined 
with a proof which was plain wrong, and could easily be disproven 
using basic enthropy accounting (i.e. how much information is around to 
play with.)


However, it did cause me to clarify the text portion a little bit.  In 
particular, *in practice* in may be possible to *probabilistically* 
detect multidisk corruption.  Probabilistic detection means that the 
detection is not guaranteed, but it can be taken advantage of 
opportunistically.


In particular, if you follow the algorithm of section 4 of my paper, you 
end up with a corrupt disk number, but the result is a vector, not a 
scalar.  This is because the algorithm is executed on the P* and Q* 
error vectors on a byte by byte basis.


In the common case of a single disk corruption, what you will typically 
see is an error pattern that has a consistent value interrupted by 
correct bytes (P* = Q* = {00}); this is due to bytes which still had the 
random value by chance.  For the z values which can be computed (recall, 
z is only well-defined if P* and Q* are != {00}), they should match.


There are two patterns which are likely to indicate multi-disk 
corruption and where recovery software should trip out and raise hell:


* z = n: the computed error disk doesn't exist.

Obviously, if the corrupt disk is a disk that can't exist, we
have a bigger problem.

This is probabilistic, since as n approaches 255, the
probability of detection goes to zero.

* Inconsistent z numbers (or spurious P and Q references)

If the calculation for which disk is corrupt jumps around
within a single sector, there is likely a problem.

It's worth noting in all of this that there is 258 possible outcomes of 
the complete error analysis algorithm - 255 possible D errors (z 
values), P error, Q error, and no error.  If these are to be analyzed as 
an array, it can't be solely a byte array.


That this set is complete is shown by the fact that out of 65536 
possible (P, Q) states, this corresponds to:


1 state no error
255 states P error (the 256th state is a no-error state!)
255 states Q error
255*255 states D error (n = 255 is maximum for byte-oriented RAID-6)

... for a total of 65536 states.

-hpa

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: switching root fs '/' to boot from RAID1 with grub

2007-11-04 Thread H. Peter Anvin

Bill Davidsen wrote:


I don't understand your point, unless there's a Linux bootloader in the 
BIOS it will boot whatever 512 bytes are in sector 0. So if that's crap 
it doesn't matter what it would do if it was valid, some other bytes 
came off the drive instead. Maybe Windows, since there seems to be an 
option in Windows to check the boot sector on boot and rewrite it if it 
isn't the WinXP one.  One of my offspring has that problem, dual boot 
system, every time he boots Windows he has to boot from rescue and 
reinstall grub.


I think he could install grub in the partition, make that the active 
partition, and the boot would work, but he tried and only type FAT or 
VFAT seem to boot, active or not.




The Grub-promoted practice of stuffing the Linux bootloader in the MBR 
is a bad idea, but that's not the issue here.


The issue here is that the bootloader itself is capable of making the 
decision to reject a corrupt image and boot the next device.  The Linux 
kernel, unfortunately, doesn't have a sane way to do that.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: switching root fs '/' to boot from RAID1 with grub

2007-11-03 Thread H. Peter Anvin

Bill Davidsen wrote:


Depends how bad the drive is.  Just to align the thread on this -  
If the boot sector is bad - the bios on newer boxes will skip to the 
next one.  But if it is good, and you boot into garbage - - could be 
Windows.. does it crash?


Right, if the drive is dead almost every BIOS will fail over, if the 
read gets a CRC or similar most recent BIOS will fail over, but if an 
error-free read returns bad data, how can the BIOS know.




Unfortunately the Linux boot format doesn't contain any sort of 
integrity check.  Otherwise the bootloader could catch this kind of 
error and throw a failure, letting the next disk boot (or another kernel.)


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: switching root fs '/' to boot from RAID1 with grub

2007-11-01 Thread H. Peter Anvin

Doug Ledford wrote:


device /dev/sda (hd0)
root (hd0,0)
install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) 
/boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst
device /dev/hdc (hd0)
root (hd0,0)
install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) 
/boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst

That will install grub on the master boot record of hdc and sda, and in
both cases grub will look to whatever drive it is running on for the
files to boot instead of going to a specific drive.



No, it won't... it'll look for the first drive in the system (BIOS drive 
80h).  This means that if the BIOS can see the bad drive, but it doesn't 
work, you're still screwed.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: switching root fs '/' to boot from RAID1 with grub

2007-11-01 Thread H. Peter Anvin

Doug Ledford wrote:


Correct, and that's what you want.  The alternative is that if the BIOS
can see the first disk but it's broken and can't be used, and if you
have the boot sector on the second disk set to read from BIOS disk 0x81
because you ASSuMEd the first disk would be broken but still present in
the BIOS tables, then your machine won't boot unless that first dead but
preset disk is present.  If you remove the disk entirely, thereby
bumping disk 0x81 to 0x80, then you are screwed.  If you have any drive
failure that prevents the first disk from being recognized (blown fuse,
blown electronics, etc), you are screwed until you get a new disk to
replace it.



What you want is for it to use the drive number that BIOS passes into it 
(register DL), not a hard-coded number.  That was my (only) point -- 
you're obviously right that hard-coding a number to 0x81 would be worse 
than useless.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] raid6: clean up the style of mktables.c and its output

2007-10-26 Thread H. Peter Anvin
Make both mktables.c and its output CodingStyle compliant.  Update the
copyright notice.

Signed-off-by: H. Peter Anvin [EMAIL PROTECTED]
---
 drivers/md/mktables.c |  166 +++--
 1 files changed, 79 insertions(+), 87 deletions(-)

diff --git a/drivers/md/mktables.c b/drivers/md/mktables.c
index adef299..f690649 100644
--- a/drivers/md/mktables.c
+++ b/drivers/md/mktables.c
@@ -1,13 +1,10 @@
-#ident $Id: mktables.c,v 1.2 2002/12/12 22:41:27 hpa Exp $
-/* --- *
+/* -*- linux-c -*- --- *
  *
- *   Copyright 2002 H. Peter Anvin - All Rights Reserved
+ *   Copyright 2002-2007 H. Peter Anvin - All Rights Reserved
  *
- *   This program is free software; you can redistribute it and/or modify
- *   it under the terms of the GNU General Public License as published by
- *   the Free Software Foundation, Inc., 53 Temple Place Ste 330,
- *   Bostom MA 02111-1307, USA; either version 2 of the License, or
- *   (at your option) any later version; incorporated herein by reference.
+ *   This file is part of the Linux kernel, and is made available under
+ *   the terms of the GNU General Public License version 2 or (at your
+ *   option) any later version; incorporated herein by reference.
  *
  * --- */
 
@@ -26,100 +23,95 @@
 
 static uint8_t gfmul(uint8_t a, uint8_t b)
 {
-  uint8_t v = 0;
+   uint8_t v = 0;
 
-  while ( b ) {
-if ( b  1 ) v ^= a;
-a = (a  1) ^ (a  0x80 ? 0x1d : 0);
-b = 1;
-  }
-  return v;
+   while (b) {
+   if (b  1)
+   v ^= a;
+   a = (a  1) ^ (a  0x80 ? 0x1d : 0);
+   b = 1;
+   }
+   return v;
 }
 
 static uint8_t gfpow(uint8_t a, int b)
 {
-  uint8_t v = 1;
+   uint8_t v = 1;
 
-  b %= 255;
-  if ( b  0 )
-b += 255;
+   b %= 255;
+   if (b  0)
+   b += 255;
 
-  while ( b ) {
-if ( b  1 ) v = gfmul(v,a);
-a = gfmul(a,a);
-b = 1;
-  }
-  return v;
+   while (b) {
+   if (b  1)
+   v = gfmul(v, a);
+   a = gfmul(a, a);
+   b = 1;
+   }
+   return v;
 }
 
 int main(int argc, char *argv[])
 {
-  int i, j, k;
-  uint8_t v;
-  uint8_t exptbl[256], invtbl[256];
+   int i, j, k;
+   uint8_t v;
+   uint8_t exptbl[256], invtbl[256];
 
-  printf(#include \raid6.h\\n);
+   printf(#include \raid6.h\\n);
 
-  /* Compute multiplication table */
-  printf(\nconst u8  __attribute__((aligned(256)))\n
-raid6_gfmul[256][256] =\n
-{\n);
-  for ( i = 0 ; i  256 ; i++ ) {
-printf(\t{\n);
-for ( j = 0 ; j  256 ; j += 8 ) {
-  printf(\t\t);
-  for ( k = 0 ; k  8 ; k++ ) {
-   printf(0x%02x, , gfmul(i,j+k));
-  }
-  printf(\n);
-}
-printf(\t},\n);
-  }
-  printf(};\n);
+   /* Compute multiplication table */
+   printf(\nconst u8  __attribute__((aligned(256)))\n
+  raid6_gfmul[256][256] =\n {\n);
+   for (i = 0; i  256; i++) {
+   printf(\t{\n);
+   for (j = 0; j  256; j += 8) {
+   printf(\t\t);
+   for (k = 0; k  8; k++)
+   printf(0x%02x,%c, gfmul(i, j + k),
+  (k == 7) ? '\n' : ' ');
+   }
+   printf(\t},\n);
+   }
+   printf(};\n);
 
-  /* Compute power-of-2 table (exponent) */
-  v = 1;
-  printf(\nconst u8 __attribute__((aligned(256)))\n
-raid6_gfexp[256] =\n
-{\n);
-  for ( i = 0 ; i  256 ; i += 8 ) {
-printf(\t);
-for ( j = 0 ; j  8 ; j++ ) {
-  exptbl[i+j] = v;
-  printf(0x%02x, , v);
-  v = gfmul(v,2);
-  if ( v == 1 ) v = 0; /* For entry 255, not a real entry */
-}
-printf(\n);
-  }
-  printf(};\n);
+   /* Compute power-of-2 table (exponent) */
+   v = 1;
+   printf(\nconst u8 __attribute__((aligned(256)))\n
+  raid6_gfexp[256] =\n {\n);
+   for (i = 0; i  256; i += 8) {
+   printf(\t);
+   for (j = 0; j  8; j++) {
+   exptbl[i + j] = v;
+   printf(0x%02x,%c, v, (j == 7) ? '\n' : ' ');
+   v = gfmul(v, 2);
+   if (v == 1)
+   v = 0;  /* For entry 255, not a real entry */
+   }
+   }
+   printf(};\n);
 
-  /* Compute inverse table x^-1 == x^254 */
-  printf(\nconst u8 __attribute__((aligned(256)))\n
-raid6_gfinv[256] =\n
-{\n);
-  for ( i = 0 ; i  256 ; i += 8 ) {
-printf(\t);
-for ( j = 0 ; j  8 ; j++ ) {
-  invtbl[i+j] = v = gfpow(i+j,254);
-  printf(0x%02x, , v);
-}
-printf(\n);
-  }
-  printf(};\n);
+   /* Compute inverse table x^-1 == x^254 */
+   printf(\nconst u8

[PATCH] raid6: clean up the style of raid6test/test.c

2007-10-26 Thread H. Peter Anvin
Clean up the coding style in raid6test/test.c.  Break it apart into
subfunctions to make the code more readable.

Signed-off-by: H. Peter Anvin [EMAIL PROTECTED]
---
 drivers/md/raid6test/test.c |  117 +--
 1 files changed, 69 insertions(+), 48 deletions(-)

diff --git a/drivers/md/raid6test/test.c b/drivers/md/raid6test/test.c
index 0d5cd57..559cc41 100644
--- a/drivers/md/raid6test/test.c
+++ b/drivers/md/raid6test/test.c
@@ -1,12 +1,10 @@
 /* -*- linux-c -*- --- *
  *
- *   Copyright 2002 H. Peter Anvin - All Rights Reserved
+ *   Copyright 2002-2007 H. Peter Anvin - All Rights Reserved
  *
- *   This program is free software; you can redistribute it and/or modify
- *   it under the terms of the GNU General Public License as published by
- *   the Free Software Foundation, Inc., 53 Temple Place Ste 330,
- *   Bostom MA 02111-1307, USA; either version 2 of the License, or
- *   (at your option) any later version; incorporated herein by reference.
+ *   This file is part of the Linux kernel, and is made available under
+ *   the terms of the GNU General Public License version 2 or (at your
+ *   option) any later version; incorporated herein by reference.
  *
  * --- */
 
@@ -30,67 +28,87 @@ char *dataptrs[NDISKS];
 char data[NDISKS][PAGE_SIZE];
 char recovi[PAGE_SIZE], recovj[PAGE_SIZE];
 
-void makedata(void)
+static void makedata(void)
 {
int i, j;
 
-   for (  i = 0 ; i  NDISKS ; i++ ) {
-   for ( j = 0 ; j  PAGE_SIZE ; j++ ) {
+   for (i = 0; i  NDISKS; i++) {
+   for (j = 0; j  PAGE_SIZE; j++)
data[i][j] = rand();
-   }
+
dataptrs[i] = data[i];
}
 }
 
+static char disk_type(int d)
+{
+   switch (d) {
+   case NDISKS-2:
+   return 'P';
+   case NDISKS-1:
+   return 'Q';
+   default:
+   return 'D';
+   }
+}
+
+static int test_disks(int i, int j)
+{
+   int erra, errb;
+
+   memset(recovi, 0xf0, PAGE_SIZE);
+   memset(recovj, 0xba, PAGE_SIZE);
+
+   dataptrs[i] = recovi;
+   dataptrs[j] = recovj;
+
+   raid6_dual_recov(NDISKS, PAGE_SIZE, i, j, (void **)dataptrs);
+
+   erra = memcmp(data[i], recovi, PAGE_SIZE);
+   errb = memcmp(data[j], recovj, PAGE_SIZE);
+
+   if (i  NDISKS-2  j == NDISKS-1) {
+   /* We don't implement the DQ failure scenario, since it's
+  equivalent to a RAID-5 failure (XOR, then recompute Q) */
+   erra = errb = 0;
+   } else {
+   printf(algo=%-8s  faila=%3d(%c)  failb=%3d(%c)  %s\n,
+  raid6_call.name,
+  i, disk_type(i),
+  j, disk_type(j),
+  (!erra  !errb) ? OK :
+  !erra ? ERRB :
+  !errb ? ERRA : ERRAB);
+   }
+
+   dataptrs[i] = data[i];
+   dataptrs[j] = data[j];
+
+   return erra || errb;
+}
+
 int main(int argc, char *argv[])
 {
-   const struct raid6_calls * const * algo;
+   const struct raid6_calls *const *algo;
int i, j;
-   int erra, errb;
+   int err = 0;
 
makedata();
 
-   for ( algo = raid6_algos ; *algo ; algo++ ) {
-   if ( !(*algo)-valid || (*algo)-valid() ) {
+   for (algo = raid6_algos; *algo; algo++) {
+   if (!(*algo)-valid || (*algo)-valid()) {
raid6_call = **algo;
 
/* Nuke syndromes */
memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE);
 
/* Generate assumed good syndrome */
-   raid6_call.gen_syndrome(NDISKS, PAGE_SIZE, (void 
**)dataptrs);
-
-   for ( i = 0 ; i  NDISKS-1 ; i++ ) {
-   for ( j = i+1 ; j  NDISKS ; j++ ) {
-   memset(recovi, 0xf0, PAGE_SIZE);
-   memset(recovj, 0xba, PAGE_SIZE);
-
-   dataptrs[i] = recovi;
-   dataptrs[j] = recovj;
-
-   raid6_dual_recov(NDISKS, PAGE_SIZE, i, 
j, (void **)dataptrs);
-
-   erra = memcmp(data[i], recovi, 
PAGE_SIZE);
-   errb = memcmp(data[j], recovj, 
PAGE_SIZE);
-
-   if ( i  NDISKS-2  j == NDISKS-1 ) {
-   /* We don't implement the DQ 
failure scenario, since it's
-  equivalent to a RAID-5 
failure (XOR, then recompute Q) */
-   } else {
-   printf(algo=%-8s  
faila=%3d(%c)  failb=%3d(%c)  %s\n

Re: [PATCH] [mdadm] Add klibc support to mdadm.h

2007-10-02 Thread H. Peter Anvin

maximilian attems wrote:

klibc still misses a lot functionality to let mdadm link against,
this small step helps to get to the real trouble.. :)

Signed-off-by: maximilian attems [EMAIL PROTECTED]
---
 mdadm.h |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/mdadm.h b/mdadm.h
index ac7d4b4..dba09f0 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -29,7 +29,7 @@
 
 #define	_GNU_SOURCE

 #include   unistd.h
-#ifndef __dietlibc__
+#if !defined(__dietlibc__)  !defined(__KLIBC__)
 extern __off64_t lseek64 __P ((int __fd, __off64_t __offset, int __whence));
 #else


Wouldn't it be better to just compile with -D_FILE_OFFSET_BITS=64 on all 
libraries instead of using the LFS cruft?


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)

2007-05-10 Thread H. Peter Anvin
Satyam Sharma wrote:
 On 5/10/07, Xavier Bestel [EMAIL PROTECTED] wrote:
 On Thu, 2007-05-10 at 16:51 +0200, Jan Engelhardt wrote:
  (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is
  probably
  some strange mixup of Andrew Morton and Andi Kleen in your mind ;)
 
  What do the letters kp stand for?
 
 Heh ... I've always wanted to know that myself. It's funny, no one
 seems to have asked that on lkml during all these years (at least none
 that a Google search would throw up).
 
 Keep Patching ?
 
 Unlikely. akpm seems to be a pre-Linux-kernel nick.

http://en.wikipedia.org/wiki/Andrew_Morton_%28computer_programmer%29

-hpa

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkinitrd and RAID6 on FC5

2007-04-23 Thread H. Peter Anvin

Guy Watkins wrote:

Is this a REDHAT only problem/bug?  If so, since bugzilla.redhat.com gets
ignored, where do I complain?


Yes, this is Redhat only, and as far as I know, it was fixed a long time 
ago.  I suspect you need to make sure you upgrade your entire system, 
especially mkinitrd, not just the kernel.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkinitrd and RAID6 on FC5

2007-04-23 Thread H. Peter Anvin

Guy Watkins wrote:


I tried to update/upgrade and no updates are available for mkinitrd.  Do you
know what version has the fix?  The bugzilla was never closed, so it seems
it has not been fixed.

My version:
mkinitrd.i3865.0.32-2   installed



I guess Red Hat decided not to fix this in FC5.  It does work in FC6; 
for FC5 I guess you're stuck passing --with=raid456 to mkinitrd :-/


-hpa

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-13 Thread H. Peter Anvin

Andre Noll wrote:

On 00:21, H. Peter Anvin wrote:

I have just updated the paper at:

http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

... with this information (in slightly different notation and with a bit 
more detail.)


There's a typo in the new section:

s/By assumption, X_z != D_n/By assumption, X_z != D_z/



Thanks, fixed.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reshaping raid0/10

2007-03-10 Thread H. Peter Anvin

Neil Brown wrote:


If I wanted to reshape a raid0, I would just morph it into a raid4
with a missing parity drive, then use the raid5 code to restripe it.
Then morph it back to regular raid0.



Wow, that made my brain hurt.

Given the fact that we're going to have to do this on kernel.org soon, 
what would be the concrete steps involved (we're going to have to change 
3-member raid0 into 4-member raid0)...


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-08 Thread H. Peter Anvin

Bill Davidsen wrote:


When last I looked at Hamming code, and that would be 1989 or 1990, I 
believe that I learned that the number of Hamming bits needed to cover N 
data bits was 1+log2(N), which for 512 bytes would be 1+12, and fit into 
a 16 bit field nicely. I don't know that I would go that way, fix any 
one bit error, detect any two bit error, rather than a CRC which gives 
me only one chance in 64k of an undetected data error, but I find it 
interesting.




A Hamming code across the bytes of a sector is pretty darn pointless, 
since that's not a typical failure pattern.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1, hot-swap and boot integrity

2007-03-07 Thread H. Peter Anvin

Mike Accetta wrote:

I gathered the impression somewhere, perhaps incorrectly, that the active
flag was a function of the boot block, not the BIOS.  We use Grub in the 
MBR and don't even have an active flag set in the partition table.  The system

still boots.


The active flag is indeed an MBR issue.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-07 Thread H. Peter Anvin

H. Peter Anvin wrote:

Eyal Lebedinsky wrote:

Neil Brown wrote:
[trim Q re how resync fixes data]

For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
and writing it over all other copies.
For raid5 we assume the data is correct and update the parity.


Can raid6 identify the bad block (two parity blocks could allow this
if only one block has bad data in a stripe)? If so, does it?

This will surely mean more value for raid6 than just the two-disk-failure
protection.



No.  It's not mathematically possible.



Okay, I've thought about it, and I got it wrong the first time 
(off-the-cuff misapplication of the pigeonhole principle.)


It apparently *is* possible (for notation and algebra rules, see my paper):

Let's assume we know exactly one of the data (Dn) drives is corrupt 
(ignoring the case of P or Q corruption for now.)  That means instead of 
Dn we have a corrupt value, Xn.  Note that which data drive that is 
corrupt (n) is not known.


We compute P' and Q' as the computed values over the corrupt set.

P+P' = Dn+Xn
Q+Q' = g^n Dn + g^n Xn  g = {02}

Q+Q' = g^n (Dn+Xn)

By assumption, Dn != Xn, so P+P' = Dn+Xn != {00}.
g^n is *never* {00}, so Q+Q' = g^n (Dn+Xn) != {00}.

(Q+Q')/(P+P') = [g^n (Dn+Xn)]/(Dn+Xn) = g^n

Since n is known to be in the range [0,255), we thus have:

n = log_g((Q+Q')/(P+P'))

... which is a well-defined relation.

For the case where either the P or the Q drives are corrupt (and the 
data drives are all good), this is easily detected by the fact that if P 
is the corrupt drive, Q+Q' = {00}; similarly, if Q is the corrupt drive, 
P+P' = {00}.  Obviously, if P+P' = Q+Q' = {00}, then as far as RAID-6 
can discover, there is no corruption in the drive set.


So, yes, RAID-6 *can* detect single drive corruption, and even tell you 
which drive it is, if you're willing to compute a full syndrome set (P', 
Q') on every read (as well on every write.)


Note: RAID-6 cannot detect 2-drive corruption, unless of course the 
corruption is in different byte positions.  If multiple corresponding 
byte positions are corrupt, then the algorithm above will generally 
point you to a completely innocent drive.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1, hot-swap and boot integrity

2007-03-05 Thread H. Peter Anvin

Mike Accetta wrote:

I wonder if having the MBR typically outside of the array and the relative
newness of partitioned arrays are related?  When I was considering how to
architect the RAID1 layout it seemed like a partitioned array on the
entire disk worked most naturally.


It's one way to do it, for sure.  The main problem with that, of course, 
is that it's not compatible with other operating systems.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1, hot-swap and boot integrity

2007-03-04 Thread H. Peter Anvin

Mike Accetta wrote:


I've been considering trying something like having the re-sync algorithm
on a whole disk array defer the copy for sector 0 to the very end of the
re-sync operation.  Assuming the BIOS makes at least a minimal consistency
check on sector 0 before electing to boot from the drive, this would keep
it from selecting a partially re-sync'd drive that was not previously 
bootable.


The only check that it will make is to look for 55 AA at the end of the MBR.

Note that typically the MBR is not part of any of your MD volumes.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-28 Thread H. Peter Anvin

James Bottomley wrote:

On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:

4104.  It's 8 bytes per hardware sector.  At least for T10...


Er ... that won't look good to the 512 ATA compatibility remapping ...



Well, in that case you'd only see 8x512 data bytes, no metadata...

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread H. Peter Anvin

Theodore Tso wrote:


In any case, the reason why I bring this up is that it would be really
nice if there was a way with a single laptop drive to be able to do
snapshots and background fsck's without having to use initrd's with
device mapper.



This is a major part of why I've been trying to push integrated klibc to 
have all that stuff as a unified kernel deliverable.  Unfortunately, 
as you know, Linus apparently rejected the concept at least for now at 
LKS last year.


With klibc this stuff could still be in one single wrapper without funny 
dependencies, but wouldn't have to be ported to kernel space.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread H. Peter Anvin

Ric Wheeler wrote:


We still have the following challenges:

   (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.




Probably the only sane thing to do is to remember the bad sectors and 
avoid attempting reading them; that would mean marking automatic 
versus explicitly requested requests to determine whether or not to 
filter them against a list of discovered bad blocks.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread H. Peter Anvin

Andreas Dilger wrote:

And clearing this list when the sector is overwritten, as it will almost
certainly be relocated at the disk level.


Certainly if the overwrite is successful.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA/SATA Disk Reliability paper

2007-02-19 Thread H. Peter Anvin

Richard Scobie wrote:
Thought this paper may be of interest. A study done by Google on over 
100,000 drives they have/had in service.


http://labs.google.com/papers/disk_failures.pdf



Bastards:

Failure rates are known to be highly correlated with drive
models, manufacturers and vintages [18]. Our results do
not contradict this fact. For example, Figure 2 changes
significantly when we normalize failure rates per each
drive model. Most age-related results are impacted by
drive vintages. However, in this paper, we do not show a
breakdown of drives per manufacturer, model, or vintage
due to the proprietary nature of these data.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [md] RAID6: clean up CPUID and FPU enter/exit code

2007-02-08 Thread H. Peter Anvin
My apologies for the screwed-up 'To:' line in the previous email... I 
did -s `head -1 file` instead of -s `head -1 file` by mistake [:^O


-hpa (who is going to bed now...)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: strange raid6 assembly problem

2006-08-24 Thread H. Peter Anvin

Mickael Marchand wrote:


so basically I don't really know what to do with my sdf3 at the moment
and fear to reboot again :o)
maybe a --re-add /dev/sdf3 could work here ? but will it survive a
reboot ?



At this point, for whatever reason, your kernel doesn't see /dev/sdf3 as 
part of the array.


You could mdadm --add it, and yes, it should survive a reboot.  Unless 
something is seriously goofy, of course, but that's impossible to 
determine from your trouble report.


A RAID-6 in two-disk degraded mode often ends up needing two recovery 
passes (one to go from 2-1 and one from 1-0).  This isn't a technical 
need, but is a result of the fact that unless you happen to have two 
hotspares standing by, the 2-1 recovery typically will have started by 
the time second disk is added.  This may be the source of your strangeness.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux: Why software RAID?

2006-08-23 Thread H. Peter Anvin

Chris Friesen wrote:

Jeff Garzik wrote:

But anyway, to help answer the question of hardware vs. software RAID, 
I wrote up a page:


http://linux.yyz.us/why-software-raid.html


Just curious...with these guys 
(http://www.bigfootnetworks.com/KillerOverview.aspx) putting linux on a 
PCI NIC to allow them to bypass Windows' network stack, has anyone ever 
considered doing hardware raid by using an embedded cpu running linux 
software RAID, with battery-backed memory?


It would theoretically allow you to remain feature-compatible by 
downloading new kernels to your RAID card.




Yes.  In fact, I have been told by several RAID chip vendors that their 
customers are *strongly* demanding that their chips be able to run Linux 
 md (and still use whatever hardware offload features.)


So it's happening.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple raids on one machine?

2006-06-25 Thread H. Peter Anvin

Chris Allen wrote:


2. Partition the raw disks into four partitions and make 
/dev/md0,md1,md2,md3.
But am I heading for problems here? Is there going to be a big 
performance hit
with four raid5 arrays on the same machine? Am I likely to have dataloss 
problems

if my machine crashes?



2 should work just fine.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ok to go ahead with this setup?

2006-06-22 Thread H. Peter Anvin

Molle Bestefich wrote:

Christian Pernegger wrote:

Intel SE7230NH1-E mainboard
Pentium D 930


HPA recently said that x86_64 CPUs have better RAID5 performance.


Actually, anything with SSE2 should be OK.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ok to go ahead with this setup?

2006-06-22 Thread H. Peter Anvin

Molle Bestefich wrote:

Christian Pernegger wrote:

Anything specific wrong with the Maxtors?


No.  I've used Maxtor for a long time and I'm generally happy with them.

They break now and then, but their online warranty system is great.
I've also been treated kindly by their help desk - talked to a cute
gal from Maxtor in Ireland over the phone just yesterday ;-).

Then again, they've just been acquired by Seagate, or so, so things
may change for the worse, who knows.

I'd watch out regarding the Western Digital disks, apparently they
have a bad habit of turning themselves off when used in RAID mode, for
some reason:
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/1980/



I have exactly the opposite experience.  More than 50% of Maxtor drives 
fail inside 18 months; WDs seem to be really solid.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Curious code in autostart_array

2006-06-22 Thread H. Peter Anvin

Pete Zaitcev wrote:

Hi, guys:

My copy of 2.6.17-rc5 has the following code in autostart_array():
mdp_disk_t *desc = sb-disks + i;
dev_t dev = MKDEV(desc-major, desc-minor);

if (!dev)
continue;
if (dev == startdev)
continue;
if (MAJOR(dev) != desc-major || MINOR(dev) != desc-minor)
continue;

Under what conditions do you think the last if() statement can fire?
What is its purpose? This looks like an attempt to detect bit clipping.
But what exactly?



It can fire if either desc-major or desc-minor overflow the respective 
fields in dev_t.  Unfortunately, it's not guaranteed to do so.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6

2006-06-16 Thread H. Peter Anvin
Followup to:  [EMAIL PROTECTED]
By author:=?GB2312?B?uPDQ29fK?= [EMAIL PROTECTED]
In newsgroup: linux.dev.raid

 I am confronted  with a big problem of the raid6 algorithm,
 when recently I learn the raid6 code of linux 2.6 you have contributed
 .
  Unfortunately I can not understand the algorithm of  P +Q parity in
 this program . Is this some formula for this raid6 algorithm? I realy
 respect your help,could you show me some details about this algorithm?

   http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which CPU for XOR?

2006-06-09 Thread H. Peter Anvin
Followup to:  [EMAIL PROTECTED]
By author:Dexter Filmore [EMAIL PROTECTED]
In newsgroup: linux.dev.raid

 What type of operation is XOR anyway? Should be ALU, right?
 So - what CPU is best for software raid? One with high integer processing 
 power? 
 

Something with massive wide vector registers.

PowerPC with Altivec totally kicks ass; x86-64 isn't too bad either.

There are also some processors with builtin RAID accelerators; at
least Intel, Broadcom and AMCC make them.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: And then there was Bryce...

2006-06-08 Thread H. Peter Anvin
Followup to:  [EMAIL PROTECTED]
By author:John Stoffel [EMAIL PROTECTED]
In newsgroup: linux.dev.raid
 
 The problem is more likely that your /etc/mdadm/mdadm.conf file is
 specifying exactly which partitions to use, instead of just doing
 something like the following:
 
   DEVICE partitions
   ARRAY /dev/md0 level=raid1 auto=yes num-devices=2 
 UUID=2e078443:42b63ef5:cc179492:aecf0094
 
 Which should do the trick for you.  Can you post your mdadm.conf file
 so we can look it over?

Hey guys, look at the syslog output again.  He's using kernel autorun.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: And then there was Bryce...

2006-06-08 Thread H. Peter Anvin
Followup to:  [EMAIL PROTECTED]
By author:Henrik Holst [EMAIL PROTECTED]
In newsgroup: linux.dev.raid
 
 The same happened to me with eth0-2. I _could_ not for my life
 understand why I didn't get internet connect to work. But then I
 realized that eth0 and eth1 had been swapped after I upgraded to udev.
 Please advice your distribution udev documentation how to lock down
 scsi and network cards to specific kernel names.
 

This doesn't explain how come it bound drives without superblocks.
It should only bind drives with the correct superblock UUID, EVER.

Udev doesn't actually matter here, since the kernel, not udev, assigns
the numbers to the drives.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with large devices 2TB

2006-05-14 Thread H. Peter Anvin
Followup to:  [EMAIL PROTECTED]
By author:Jim Klimov [EMAIL PROTECTED]
In newsgroup: linux.dev.raid
 
   Since the new parted worked ok (older one didn't), we were happy
   until we tried a reboot. During the device initialization and after
   it the system only recognises the 6 or 7 partitions which start
   before the 2000Gb limit:
 

For a DOS partition table, there is no such thing as a partition
starting beyond 2 TB.  You need to use a GPT or other more
sophisticated partition table.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 005 of 11] md: Merge raid5 and raid6 code

2006-04-30 Thread H. Peter Anvin

NeilBrown wrote:

There is a lot of commonality between raid5.c and raid6main.c.  This
patches merges both into one module called raid456.  This saves a lot
of code, and paves the way for online raid5-raid6 migrations.

There is still duplication, e.g. between handle_stripe5 and
handle_stripe6.  This will probably be cleaned up later.

Cc:  H. Peter Anvin [EMAIL PROTECTED]
Signed-off-by: Neil Brown [EMAIL PROTECTED]



Wonderful!  Thank you for doing this :)

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [klibc] Re: Exporting which partitions to md-configure

2006-02-07 Thread H. Peter Anvin

Luca Berra wrote:


This, in fact is *EXACTLY* what we're talking about; it does require 
autoassemble.  Why do we care about the partition types at all?  The 
reason is that since the md superblock is at the end, it doesn't get 
automatically wiped if the partition is used as a raw filesystem, and 
so it's important that there is a qualifier for it.


I don't like using partition type as a qualifier, there is people who do
not wish to partition their drives, there are systems not supporting
msdos like partitions, heck even m$ is migrating away from those.



That's why we're talking about non-msdos partitioning schemes.


In any case if that has to be done it should be done into mdadm, not
in a different scrip that is going to call mdadm (behaviour should be
consistent between mdadm invoked by initramfs and mdadm invoked on a
running system).


Agreed.  mdadm is the best place for it.


If the user wants to reutilize a device that was previously a member of
an md array he/she should use mdadm --zero-superblock to remove the
superblock.
I see no point in having a system that tries to compensate for users not
following correct procedures. sorry.


You don't?  That surprises me... making it harder for the user to have 
accidental data loss sounds like a very good thing to me.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [klibc] Re: Exporting which partitions to md-configure

2006-02-07 Thread H. Peter Anvin

Luca Berra wrote:


making it harder for the user is a good thing, but please not at the
expense of usability



What's the usability problem?

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [klibc] Re: Exporting which partitions to md-configure

2006-02-06 Thread H. Peter Anvin

Neil Brown wrote:


What constitutes 'a piece of data'?  A bit? a byte?

I would say that 
   msdos:fd

is one piece of data.  The 'fd' is useless without the 'msdos'.
The 'msdos' is, I guess, not completely useless with the fd.

I would lean towards the composite, but I wouldn't fight a separation.



Well, the two pieces come from different sources.



Just as there is a direct unambiguous causal path from something
present at early boot to the root filesystem that is mounted (and the
root filesystem specifies all other filesystems through fstab)
similarly there should be an unambiguous causal path from something
present at early boot to the array which holds the root filesystem -
and the root filesystem should describe all other arrays via
mdadm.conf

Does that make sense?



It makes sense, but I disagree.  I believe you are correct in that the 
current preferred minor bit causes an invalid assumption that, e.g. 
/dev/md3 is always a certain thing, but since each array has a UUID, and 
one should be able to mount by either filesystem UUID or array UUID, 
there should be no need for such a conflict if one allows for dynamic md 
numbers.


Requiring that mdadm.conf describes the actual state of all volumes 
would be an enormous step in the wrong direction.  Right now, the Linux 
md system can handle some very oddball hardware changes (such as on 
hera.kernel.org, when the disks not just completely changed names due to 
a controller change, but changed from hd* to sd*!)


Dynamicity is a good thing, although it needs to be harnessed.

 kernel parameter md_root_uuid=xxyy:zzyy:aabb:ccdd...
This could be interpreted by an initramfs script to run mdadm
to find and assemble the array with that uuid.  The uuid of
each array is reasonably unique.

This, in fact is *EXACTLY* what we're talking about; it does require 
autoassemble.  Why do we care about the partition types at all?  The 
reason is that since the md superblock is at the end, it doesn't get 
automatically wiped if the partition is used as a raw filesystem, and so 
it's important that there is a qualifier for it.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exporting which partitions to md-configure

2006-01-30 Thread H. Peter Anvin

Neil Brown wrote:

On Monday January 30, [EMAIL PROTECTED] wrote:

Any feeling how best to do that?  My current thinking is to export a 
flags entry in addition to the current ones, presumably based on 
struct parsed_partitions-parts[].flags (fs/partitions/check.h), which 
seems to be what causes md_autodetect_dev() to be called.


I think I would prefer a 'type' attribute in each partition that
records the 'type' from the partition table.  This might be more
generally useful than just for md.
Then your userspace code would have to look for '253' and use just
those partitions.



What about non-DOS partitions?

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exporting which partitions to md-configure

2006-01-30 Thread H. Peter Anvin

Kyle Moffett wrote:


Well, for an MSDOS partition table, you would look for '253', for a  Mac 
partition table you could look for something like 'Linux_RAID' or  
similar (just arbitrarily define some name beginning with the Linux_  
prefix), etc.  This means that the partition table type would need to  
be exposed as well (I don't know if it is already).




It's not, but perhaps exporting format and type as distinct 
attributes is the way to go.  The policy for which partitions to 
consider would live entirely in kinit that way.


type would be format-specific; in EFI it's a UUID.

This, of course, is a bigger change, but it just might be worth it.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exporting which partitions to md-configure

2006-01-30 Thread H. Peter Anvin

Neil Brown wrote:


Well, grepping through fs/partitions/*.c, the 'flags' thing is set by
 efi.c, msdos.c sgi.c sun.c

Of these, efi compares something against PARTITION_LINUX_RAID_GUID,
and msdos.c, sgi.c and sun. compare something against
LINUX_RAID_PARTITION.

The former would look like
  e6d6d379-f507-44c2-a23c-238f2a3df928
in sysfs (I think);
The latter would look like
  fd
(I suspect).

These are both easily recognisable with no real room for confusion.


Well, if we're going to have a generic facility it should make sense 
across the board.  If all we're doing is supporting legacy usage we 
might as well export a flag.


I guess we could have a single entry with a string of the form 
efi:e6d6d379-f507-44c2-a23c-238f2a3df928 or msdos:fd etc -- it 
really doesn't make any difference to me, but it seems cleaner to have 
two pieces of data in two different sysfs entries.




And if other partition styles wanted to add support for raid auto
detect, tell them no. It is perfectly possible and even preferable
to live without autodetect.   We should support legacy usage (those
above) but should discourage any new usage.



Why is that, keeping in mind this will all be done in userspace?

-hpa

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exporting which partitions to md-configure

2006-01-30 Thread H. Peter Anvin

Neil Brown wrote:


Mac partition tables doesn't currently support autodetect (as far as I
can tell).  Let's keep it that way.



For now I guess I'll just take the code from init/do_mounts_md.c; we can 
worry about ripping the RAID_AUTORUN code out of the kernel later.


-hpa

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding Reed-Solomon Personality to MD, need help/advice

2005-12-29 Thread H. Peter Anvin

Jeff Breidenbach wrote:

The fundamental problem is that generic RS requires table lookups even
in the common case, whereas RAID-6 uses shortcuts to substantially
speed up the computation in the common case.



If one wanted to support a typical 8-bit RS code (which supports a max of
256 drives, including ECC drives) it is already way too big to use a table. RS
is typically done with finite field math calculations which are -
relatively - fast
but they are much heavier than a parity calculation. Here is one commercial
benchmark, note the throughput numbers at the bottom of the page.



Well, most of them are implemented via tables (GF log table, etc.)  They 
tend to perform poorly on modern hardware.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EVMS or md?

2005-04-04 Thread H. Peter Anvin
Followup to:  [EMAIL PROTECTED]
By author:David Kewley [EMAIL PROTECTED]
In newsgroup: linux.dev.raid

 Mike Tran wrote on Monday 04 April 2005 12:28:
  We (EVMS team) intended to support RAID6 last year.  But as we all
  remember RAID6 was not stable then.  I may write a plugin to support
  RAID6 soon.
 
 Hi Mike,
 
 In your view, is RAID6 now considered stable?  How soon might you have an 
 evms 
 plugin for it? ;)  I'd love to use evms on my new filserver if it supported 
 RAID6.
 

I can't speak for the EVMS people, but I got to stress-test my RAID6
test system some this weekend; after having run in 1-disk degraded
mode for several months (thus showing that the big bad degraded
write bug has been thoroughly fixed) I changed the motherboard, and
the kernel didn't support one of the controllers.  And now there were
2 missing drives.  Due to some bootloader problems, I ended up
yo-yoing between the two kernels a bit more than I intended to, and
went through quite a few RAID disk losses and rebuilds as a result.

No hiccups, data losses, or missing functionality.  At the end of the
whole ordeal, the filesystem (1 TB, 50% full) was still quite prisine,
and fsck confirmed this.  I was quite pleased :)

Oh, and doing the N-2 - N-1 rebuild is slow (obviously), but not
outrageously so.  It rebuilt the 1 TB array in a matter of
single-digit hours.  CPU utilitization was quite high, obviously, but
it didn't cripple the system by any means.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Forcing a more random uuid (random seed bug)

2005-02-22 Thread H. Peter Anvin
Followup to:  [EMAIL PROTECTED]
By author:Niccolo Rigacci [EMAIL PROTECTED]
In newsgroup: linux.dev.raid

  I get /dev/md5, /dev/md6, /dev/md7
  and /dev/md8 all with the same UUID!
 
 It seems that there is a bug in mdadm: when generating the UUID for a 
 volume, the random() function is called, but the random sequence is never 
 initialized.
 
 The result is that every volume created with mdadm has an uuid of:
 6b8b4567:327b23c6:643c9869:66334873
 
 See also Debian bug 292784 at
 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=292784
 
 I fixed the problem adding the following patch to mdadm.c, but please bear 
 in mind that I'm totally unaware of mdadm code and quite naive in C 
 programming:
 

Please don't use (s)random at all, except as a possible fallback to
/dev/(u)random.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Forcing a more random uuid (random seed bug)

2005-02-22 Thread H. Peter Anvin
Followup to:  [EMAIL PROTECTED]
By author:[EMAIL PROTECTED]
In newsgroup: linux.dev.raid

 +if ((my_fd = open(/dev/random, O_RDONLY)) != -1) {
 
 Please use /dev/urandom for such applications.  /dev/random is the
 highest-quality generator, but will block if entropy isn't available.
 /dev/urandom provides the best available, immediately, which is what
 this application wants.

Not 100% clear; the best would be to make it configurable.

Either way you must not use read() in the way described.  Short reads
happen, even with /dev/urandom.
 
 Also, this will only produce 2^32 possible UUIDs, since that's the
 size of the seed.  Meaning that after you've generated 2^16 of them,
 the chances are excellent that they're not UU any more.
 
 You might just want to get all 128 (minus epsilon) bits from /dev/urandom
 directly.

You *do* want to get all bits from /dev/urandom directly.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RAID Kconfig cleanups, remove experimental tag from RAID-6

2005-02-08 Thread H. Peter Anvin
This patch removes the experimental tag from RAID-6 (unfortunately the 
damage is already done...:-|) and cleans up a few more things in the 
Kconfig file.

Signed-Off-By: H. Peter Anvin [EMAIL PROTECTED]
Index: linux-2.5/drivers/md/Kconfig
===
RCS file: /home/hpa/kernel/bkcvs/linux-2.5/drivers/md/Kconfig,v
retrieving revision 1.17
diff -u -r1.17 Kconfig
--- linux-2.5/drivers/md/Kconfig	15 Jan 2005 23:46:55 -	1.17
+++ linux-2.5/drivers/md/Kconfig	8 Feb 2005 22:02:42 -
@@ -93,7 +93,7 @@
 	  mirroring (RAID-1) with easier configuration and more flexable
 	  layout.
 	  Unlike RAID-0, but like RAID-1, RAID-10 requires all devices to
-	  be the same size (or atleast, only as much as the smallest device
+	  be the same size (or at least, only as much as the smallest device
 	  will be used).
 	  RAID-10 provides a variety of layouts that provide different levels
 	  of redundancy and performance.
@@ -102,6 +102,7 @@
 
 	  ftp://ftp.kernel.org/pub/linux/utils/raid/mdadm/
 
+	  If unsure, say Y.
 
 config MD_RAID5
 	tristate RAID-4/RAID-5 mode
@@ -120,20 +121,16 @@
 	  http://www.tldp.org/docs.html#howto. There you will also
 	  learn where to get the supporting user space utilities raidtools.
 
-	  If you want to use such a RAID-4/RAID-5 set, say Y.  To compile
-	  this code as a module, choose M here: the module will be called raid5.
+	  If you want to use such a RAID-4/RAID-5 set, say Y.  To
+	  compile this code as a module, choose M here: the module
+	  will be called raid5.
 
 	  If unsure, say Y.
 
 config MD_RAID6
-	tristate RAID-6 mode (EXPERIMENTAL)
-	depends on BLK_DEV_MD  EXPERIMENTAL
+	tristate RAID-6 mode
+	depends on BLK_DEV_MD
 	---help---
-	  WARNING: RAID-6 is currently highly experimental.  If you
-	  use it, there is no guarantee whatsoever that it won't
-	  destroy your data, eat your disk drives, insult your mother,
-	  or re-appoint George W. Bush.
-
 	  A RAID-6 set of N drives with a capacity of C MB per drive
 	  provides the capacity of C * (N - 2) MB, and protects
 	  against a failure of any two drives. For a given sector
@@ -150,7 +147,7 @@
 	  this code as a module, choose M here: the module will be
 	  called raid6.
 
-	  If unsure, say N.
+	  If unsure, say Y.
 
 config MD_MULTIPATH
 	tristate Multipath I/O support


Re: [PATCH md 2 of 4] Fix raid6 problem

2005-02-03 Thread H. Peter Anvin
Lars Marowsky-Bree wrote:
On 2005-02-03T08:39:41, H. Peter Anvin [EMAIL PROTECTED] wrote:

Yes, right now there is no RAID5-RAID6 conversion tool that I know of.
Hm. One of the checksums is identical, as is the disk layout of the
data, no?
No, the layout is different.
-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 2 of 4] Fix raid6 problem

2005-02-03 Thread H. Peter Anvin
Guy wrote:
Would you say that the 2.6 Kernel is suitable for storing mission-critical
data, then?
Sure.  I'd trust 2.6 over 2.4 at this point.
I ask because I have read about a lot of problems with data corruption and
oops on this list and the SCSI list.  But in most or all cases the 2.4
Kernel does not have the same problem.
I haven't seen any problems like that, including on kernel.org, which is 
definitely a high demand site.

Who out there has a RAID6 array that they believe is stable and safe?
And please give some details about the array.  Number of disks, sizes, LVM,
FS, SCSI, ATA and anything else you can think of?  Also, details about any
disk failures and how well recovery went?
The one I have is a 6-disk ATA array (6x250 GB), ext3.  Had one disk 
failure which hasn't been replaced yet; it's successfully running in 
1-disk degraded mode.

I'll let other people speak for themselves.
-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 2 of 4] Fix raid6 problem

2005-02-02 Thread H. Peter Anvin
Followup to:  [EMAIL PROTECTED]
By author:A. James Lewis [EMAIL PROTECTED]
In newsgroup: linux.dev.raid

 
 Sorry for the delay in replying, I've been using RAID6 in a real life
 situation with 2.6.9 + patch, for 2 months now, with 1.15Tb of storage,
 and I have had more than 1 drive failure... as well as some rather
 embarasing hardware corruption which I traced to a faulty IDE controller.
 
 Dispite some random DMA corrupion, and loosing a total of 3 disks, I have
 not had any problems with it RAID6 itself, and really it has litereally
 saved my data from being lost.
 
 I ran a diff against the 2.6.9 patch and what is in 2.6.10... and they are
 not the same, presumably a more elegant fix has been implimented for the
 production kernel??
 

I think there are some other (generic) fixes in there too.

Anyway... I'm thinking of sending in a patch to take out the
experimental status of RAID-6.  I have been running a 1 TB
production server in 1-disk degraded mode for about a month now
without incident.

-hpa

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html