Re: On the subject of RAID-6 corruption recovery
Mattias Wadenstein wrote: On Mon, 7 Jan 2008, Thiemo Nagel wrote: What you call pathologic cases when it comes to real-world data are very common. It is not at all unusual to find sectors filled with only a constant (usually zero, but not always), in which case your **512 becomes **1. Of course it would be easy to check how many of the 512 Bytes are really different on a case-by-case basis and correct the exponent accordingly, and only perform the recovery when the corrected probability of introducing an error is sufficiently low. What is the alternative to recovery, really? Just erroring out and letting the admin deal with it, or blindly assume that the parity is wrong? Erroring out. Only thing to do at that point. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On the subject of RAID-6 corruption recovery
Thiemo Nagel wrote: Inverting your argumentation, that means when we don't see z = n or inconsistent z numbers, multidisc corruption can be excluded statistically. For errors occurring on the level of hard disk blocks (signature: most bytes of the block have D errors, all with same z), the probability for multidisc corruption to go undetected is ((n-1)/256)**512. This might pose a problem in the limiting case of n=255, however for practical applications, this probability is negligible as it drops off exponentially with decreasing n: That assumes fully random data distribution, which is almost certainly a false assumption. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On the subject of RAID-6 corruption recovery
Thiemo Nagel wrote: For errors occurring on the level of hard disk blocks (signature: most bytes of the block have D errors, all with same z), the probability for multidisc corruption to go undetected is ((n-1)/256)**512. This might pose a problem in the limiting case of n=255, however for practical applications, this probability is negligible as it drops off exponentially with decreasing n: That assumes fully random data distribution, which is almost certainly a false assumption. Agreed. This means, that the formula only serves to specify a lower limit to the probability. However, is there an argumentation, why a pathologic case would be probable, i.e. why the probability would be likely to *vastly* deviate from the theoretical limit? And if there is, would that argumentation not apply to other raid 6 operations (like check) also? And would it help to use different Galois field generators at different positions in a sector instead of using a uniform generator? What you call pathologic cases when it comes to real-world data are very common. It is not at all unusual to find sectors filled with only a constant (usually zero, but not always), in which case your **512 becomes **1. It doesn't mean it's not worthwhile, but don't try to claim it is anything other than opportunistic. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On the subject of RAID-6 corruption recovery
Thiemo Nagel wrote: That's why I was asking about the generator. Theoretically, this situation might be countered by using a (pseudo-)random pattern of generators for the different bytes of a sector, though I'm not sure whether it is worth the effort. Changing the generator is mathematically equivalent to changing the order of the drives, so no, that wouldn't help (and would make the common computations a lot more expensive.) -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: On the subject of RAID-6 corruption recovery
Bill Davidsen wrote: H. Peter Anvin wrote: I got a private email a while ago from Thiemo Nagel claiming that some of the conclusions in my RAID-6 paper was incorrect. This was combined with a proof which was plain wrong, and could easily be disproven using basic enthropy accounting (i.e. how much information is around to play with.) However, it did cause me to clarify the text portion a little bit. In particular, *in practice* in may be possible to *probabilistically* detect multidisk corruption. Probabilistic detection means that the detection is not guaranteed, but it can be taken advantage of opportunistically. If this means that there can be no false positives for multidisk corruption but may be false negatives, fine. If it means something else, please restate one more time. Pretty much. False negatives are quite serious, since they will imply a course of action which will introduce further corruption. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [md-raid6-accel PATCH 01/12] async_tx: PQXOR implementation
Yuri Tikhonov wrote: This patch implements support for the asynchronous computation of RAID-6 syndromes. It provides an API to compute RAID-6 syndromes asynchronously in a format conforming to async_tx interfaces. The async_pxor and async_pqxor_zero_sum functions are very similar to async_xor functions but make use of additional tx_set_src_mult method for setting cooefficients of the RAID-6 Q syndrome. The Galois polynomial which is used in the s/w case is 0x11d (the corresponding coefficients are hard-coded in raid6_call.gen_syndrome). Because even with the h/w acceleration enabled some pqxor operations may be processed in CPU (e.g. in case of no DMA descriptors available) it's highly recommended to configure the DMA engine which your system uses to exploit exactly the same Galois polynomial. It should probably be noted here, too, that if you use a different basis polynomial for the Galois field you will end up with a different on-disk format. + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., 59 + * Temple Place - Suite 330, Boston, MA 02111-1307, USA. This address, I believe, is obsolete. + if (!(tx=async_pqxor(NULL, ptrs[failb], + ptrs[disks - 2], bc, 0, 2, bytes, + ASYNC_TX_DEP_ACK | ASYNC_TX_XOR_ZERO_DST, + tx, NULL, NULL))) { + /* It's bad if we failed here; try to repeat this +* using another failed disk as a spare; this wouldn't +* failed since now we'll be able to compute synchronously +* (there is no support for synchronous Q-only) +*/ + async_pqxor(ptrs[faila], ptrs[failb], + ptrs[disks - 2], bc, 0, 2, bytes, + ASYNC_TX_DEP_ACK | ASYNC_TX_XOR_ZERO_DST, + NULL, NULL, NULL); + } I don't really understand this logic, or the comment that goes along with it. Could you please elucidate? -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
On the subject of RAID-6 corruption recovery
I got a private email a while ago from Thiemo Nagel claiming that some of the conclusions in my RAID-6 paper was incorrect. This was combined with a proof which was plain wrong, and could easily be disproven using basic enthropy accounting (i.e. how much information is around to play with.) However, it did cause me to clarify the text portion a little bit. In particular, *in practice* in may be possible to *probabilistically* detect multidisk corruption. Probabilistic detection means that the detection is not guaranteed, but it can be taken advantage of opportunistically. In particular, if you follow the algorithm of section 4 of my paper, you end up with a corrupt disk number, but the result is a vector, not a scalar. This is because the algorithm is executed on the P* and Q* error vectors on a byte by byte basis. In the common case of a single disk corruption, what you will typically see is an error pattern that has a consistent value interrupted by correct bytes (P* = Q* = {00}); this is due to bytes which still had the random value by chance. For the z values which can be computed (recall, z is only well-defined if P* and Q* are != {00}), they should match. There are two patterns which are likely to indicate multi-disk corruption and where recovery software should trip out and raise hell: * z = n: the computed error disk doesn't exist. Obviously, if the corrupt disk is a disk that can't exist, we have a bigger problem. This is probabilistic, since as n approaches 255, the probability of detection goes to zero. * Inconsistent z numbers (or spurious P and Q references) If the calculation for which disk is corrupt jumps around within a single sector, there is likely a problem. It's worth noting in all of this that there is 258 possible outcomes of the complete error analysis algorithm - 255 possible D errors (z values), P error, Q error, and no error. If these are to be analyzed as an array, it can't be solely a byte array. That this set is complete is shown by the fact that out of 65536 possible (P, Q) states, this corresponds to: 1 state no error 255 states P error (the 256th state is a no-error state!) 255 states Q error 255*255 states D error (n = 255 is maximum for byte-oriented RAID-6) ... for a total of 65536 states. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: switching root fs '/' to boot from RAID1 with grub
Bill Davidsen wrote: I don't understand your point, unless there's a Linux bootloader in the BIOS it will boot whatever 512 bytes are in sector 0. So if that's crap it doesn't matter what it would do if it was valid, some other bytes came off the drive instead. Maybe Windows, since there seems to be an option in Windows to check the boot sector on boot and rewrite it if it isn't the WinXP one. One of my offspring has that problem, dual boot system, every time he boots Windows he has to boot from rescue and reinstall grub. I think he could install grub in the partition, make that the active partition, and the boot would work, but he tried and only type FAT or VFAT seem to boot, active or not. The Grub-promoted practice of stuffing the Linux bootloader in the MBR is a bad idea, but that's not the issue here. The issue here is that the bootloader itself is capable of making the decision to reject a corrupt image and boot the next device. The Linux kernel, unfortunately, doesn't have a sane way to do that. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: switching root fs '/' to boot from RAID1 with grub
Bill Davidsen wrote: Depends how bad the drive is. Just to align the thread on this - If the boot sector is bad - the bios on newer boxes will skip to the next one. But if it is good, and you boot into garbage - - could be Windows.. does it crash? Right, if the drive is dead almost every BIOS will fail over, if the read gets a CRC or similar most recent BIOS will fail over, but if an error-free read returns bad data, how can the BIOS know. Unfortunately the Linux boot format doesn't contain any sort of integrity check. Otherwise the bootloader could catch this kind of error and throw a failure, letting the next disk boot (or another kernel.) -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: switching root fs '/' to boot from RAID1 with grub
Doug Ledford wrote: device /dev/sda (hd0) root (hd0,0) install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst device /dev/hdc (hd0) root (hd0,0) install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst That will install grub on the master boot record of hdc and sda, and in both cases grub will look to whatever drive it is running on for the files to boot instead of going to a specific drive. No, it won't... it'll look for the first drive in the system (BIOS drive 80h). This means that if the BIOS can see the bad drive, but it doesn't work, you're still screwed. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: switching root fs '/' to boot from RAID1 with grub
Doug Ledford wrote: Correct, and that's what you want. The alternative is that if the BIOS can see the first disk but it's broken and can't be used, and if you have the boot sector on the second disk set to read from BIOS disk 0x81 because you ASSuMEd the first disk would be broken but still present in the BIOS tables, then your machine won't boot unless that first dead but preset disk is present. If you remove the disk entirely, thereby bumping disk 0x81 to 0x80, then you are screwed. If you have any drive failure that prevents the first disk from being recognized (blown fuse, blown electronics, etc), you are screwed until you get a new disk to replace it. What you want is for it to use the drive number that BIOS passes into it (register DL), not a hard-coded number. That was my (only) point -- you're obviously right that hard-coding a number to 0x81 would be worse than useless. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] raid6: clean up the style of mktables.c and its output
Make both mktables.c and its output CodingStyle compliant. Update the copyright notice. Signed-off-by: H. Peter Anvin [EMAIL PROTECTED] --- drivers/md/mktables.c | 166 +++-- 1 files changed, 79 insertions(+), 87 deletions(-) diff --git a/drivers/md/mktables.c b/drivers/md/mktables.c index adef299..f690649 100644 --- a/drivers/md/mktables.c +++ b/drivers/md/mktables.c @@ -1,13 +1,10 @@ -#ident $Id: mktables.c,v 1.2 2002/12/12 22:41:27 hpa Exp $ -/* --- * +/* -*- linux-c -*- --- * * - * Copyright 2002 H. Peter Anvin - All Rights Reserved + * Copyright 2002-2007 H. Peter Anvin - All Rights Reserved * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation, Inc., 53 Temple Place Ste 330, - * Bostom MA 02111-1307, USA; either version 2 of the License, or - * (at your option) any later version; incorporated herein by reference. + * This file is part of the Linux kernel, and is made available under + * the terms of the GNU General Public License version 2 or (at your + * option) any later version; incorporated herein by reference. * * --- */ @@ -26,100 +23,95 @@ static uint8_t gfmul(uint8_t a, uint8_t b) { - uint8_t v = 0; + uint8_t v = 0; - while ( b ) { -if ( b 1 ) v ^= a; -a = (a 1) ^ (a 0x80 ? 0x1d : 0); -b = 1; - } - return v; + while (b) { + if (b 1) + v ^= a; + a = (a 1) ^ (a 0x80 ? 0x1d : 0); + b = 1; + } + return v; } static uint8_t gfpow(uint8_t a, int b) { - uint8_t v = 1; + uint8_t v = 1; - b %= 255; - if ( b 0 ) -b += 255; + b %= 255; + if (b 0) + b += 255; - while ( b ) { -if ( b 1 ) v = gfmul(v,a); -a = gfmul(a,a); -b = 1; - } - return v; + while (b) { + if (b 1) + v = gfmul(v, a); + a = gfmul(a, a); + b = 1; + } + return v; } int main(int argc, char *argv[]) { - int i, j, k; - uint8_t v; - uint8_t exptbl[256], invtbl[256]; + int i, j, k; + uint8_t v; + uint8_t exptbl[256], invtbl[256]; - printf(#include \raid6.h\\n); + printf(#include \raid6.h\\n); - /* Compute multiplication table */ - printf(\nconst u8 __attribute__((aligned(256)))\n -raid6_gfmul[256][256] =\n -{\n); - for ( i = 0 ; i 256 ; i++ ) { -printf(\t{\n); -for ( j = 0 ; j 256 ; j += 8 ) { - printf(\t\t); - for ( k = 0 ; k 8 ; k++ ) { - printf(0x%02x, , gfmul(i,j+k)); - } - printf(\n); -} -printf(\t},\n); - } - printf(};\n); + /* Compute multiplication table */ + printf(\nconst u8 __attribute__((aligned(256)))\n + raid6_gfmul[256][256] =\n {\n); + for (i = 0; i 256; i++) { + printf(\t{\n); + for (j = 0; j 256; j += 8) { + printf(\t\t); + for (k = 0; k 8; k++) + printf(0x%02x,%c, gfmul(i, j + k), + (k == 7) ? '\n' : ' '); + } + printf(\t},\n); + } + printf(};\n); - /* Compute power-of-2 table (exponent) */ - v = 1; - printf(\nconst u8 __attribute__((aligned(256)))\n -raid6_gfexp[256] =\n -{\n); - for ( i = 0 ; i 256 ; i += 8 ) { -printf(\t); -for ( j = 0 ; j 8 ; j++ ) { - exptbl[i+j] = v; - printf(0x%02x, , v); - v = gfmul(v,2); - if ( v == 1 ) v = 0; /* For entry 255, not a real entry */ -} -printf(\n); - } - printf(};\n); + /* Compute power-of-2 table (exponent) */ + v = 1; + printf(\nconst u8 __attribute__((aligned(256)))\n + raid6_gfexp[256] =\n {\n); + for (i = 0; i 256; i += 8) { + printf(\t); + for (j = 0; j 8; j++) { + exptbl[i + j] = v; + printf(0x%02x,%c, v, (j == 7) ? '\n' : ' '); + v = gfmul(v, 2); + if (v == 1) + v = 0; /* For entry 255, not a real entry */ + } + } + printf(};\n); - /* Compute inverse table x^-1 == x^254 */ - printf(\nconst u8 __attribute__((aligned(256)))\n -raid6_gfinv[256] =\n -{\n); - for ( i = 0 ; i 256 ; i += 8 ) { -printf(\t); -for ( j = 0 ; j 8 ; j++ ) { - invtbl[i+j] = v = gfpow(i+j,254); - printf(0x%02x, , v); -} -printf(\n); - } - printf(};\n); + /* Compute inverse table x^-1 == x^254 */ + printf(\nconst u8
[PATCH] raid6: clean up the style of raid6test/test.c
Clean up the coding style in raid6test/test.c. Break it apart into subfunctions to make the code more readable. Signed-off-by: H. Peter Anvin [EMAIL PROTECTED] --- drivers/md/raid6test/test.c | 117 +-- 1 files changed, 69 insertions(+), 48 deletions(-) diff --git a/drivers/md/raid6test/test.c b/drivers/md/raid6test/test.c index 0d5cd57..559cc41 100644 --- a/drivers/md/raid6test/test.c +++ b/drivers/md/raid6test/test.c @@ -1,12 +1,10 @@ /* -*- linux-c -*- --- * * - * Copyright 2002 H. Peter Anvin - All Rights Reserved + * Copyright 2002-2007 H. Peter Anvin - All Rights Reserved * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation, Inc., 53 Temple Place Ste 330, - * Bostom MA 02111-1307, USA; either version 2 of the License, or - * (at your option) any later version; incorporated herein by reference. + * This file is part of the Linux kernel, and is made available under + * the terms of the GNU General Public License version 2 or (at your + * option) any later version; incorporated herein by reference. * * --- */ @@ -30,67 +28,87 @@ char *dataptrs[NDISKS]; char data[NDISKS][PAGE_SIZE]; char recovi[PAGE_SIZE], recovj[PAGE_SIZE]; -void makedata(void) +static void makedata(void) { int i, j; - for ( i = 0 ; i NDISKS ; i++ ) { - for ( j = 0 ; j PAGE_SIZE ; j++ ) { + for (i = 0; i NDISKS; i++) { + for (j = 0; j PAGE_SIZE; j++) data[i][j] = rand(); - } + dataptrs[i] = data[i]; } } +static char disk_type(int d) +{ + switch (d) { + case NDISKS-2: + return 'P'; + case NDISKS-1: + return 'Q'; + default: + return 'D'; + } +} + +static int test_disks(int i, int j) +{ + int erra, errb; + + memset(recovi, 0xf0, PAGE_SIZE); + memset(recovj, 0xba, PAGE_SIZE); + + dataptrs[i] = recovi; + dataptrs[j] = recovj; + + raid6_dual_recov(NDISKS, PAGE_SIZE, i, j, (void **)dataptrs); + + erra = memcmp(data[i], recovi, PAGE_SIZE); + errb = memcmp(data[j], recovj, PAGE_SIZE); + + if (i NDISKS-2 j == NDISKS-1) { + /* We don't implement the DQ failure scenario, since it's + equivalent to a RAID-5 failure (XOR, then recompute Q) */ + erra = errb = 0; + } else { + printf(algo=%-8s faila=%3d(%c) failb=%3d(%c) %s\n, + raid6_call.name, + i, disk_type(i), + j, disk_type(j), + (!erra !errb) ? OK : + !erra ? ERRB : + !errb ? ERRA : ERRAB); + } + + dataptrs[i] = data[i]; + dataptrs[j] = data[j]; + + return erra || errb; +} + int main(int argc, char *argv[]) { - const struct raid6_calls * const * algo; + const struct raid6_calls *const *algo; int i, j; - int erra, errb; + int err = 0; makedata(); - for ( algo = raid6_algos ; *algo ; algo++ ) { - if ( !(*algo)-valid || (*algo)-valid() ) { + for (algo = raid6_algos; *algo; algo++) { + if (!(*algo)-valid || (*algo)-valid()) { raid6_call = **algo; /* Nuke syndromes */ memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE); /* Generate assumed good syndrome */ - raid6_call.gen_syndrome(NDISKS, PAGE_SIZE, (void **)dataptrs); - - for ( i = 0 ; i NDISKS-1 ; i++ ) { - for ( j = i+1 ; j NDISKS ; j++ ) { - memset(recovi, 0xf0, PAGE_SIZE); - memset(recovj, 0xba, PAGE_SIZE); - - dataptrs[i] = recovi; - dataptrs[j] = recovj; - - raid6_dual_recov(NDISKS, PAGE_SIZE, i, j, (void **)dataptrs); - - erra = memcmp(data[i], recovi, PAGE_SIZE); - errb = memcmp(data[j], recovj, PAGE_SIZE); - - if ( i NDISKS-2 j == NDISKS-1 ) { - /* We don't implement the DQ failure scenario, since it's - equivalent to a RAID-5 failure (XOR, then recompute Q) */ - } else { - printf(algo=%-8s faila=%3d(%c) failb=%3d(%c) %s\n
Re: [PATCH] [mdadm] Add klibc support to mdadm.h
maximilian attems wrote: klibc still misses a lot functionality to let mdadm link against, this small step helps to get to the real trouble.. :) Signed-off-by: maximilian attems [EMAIL PROTECTED] --- mdadm.h |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/mdadm.h b/mdadm.h index ac7d4b4..dba09f0 100644 --- a/mdadm.h +++ b/mdadm.h @@ -29,7 +29,7 @@ #define _GNU_SOURCE #include unistd.h -#ifndef __dietlibc__ +#if !defined(__dietlibc__) !defined(__KLIBC__) extern __off64_t lseek64 __P ((int __fd, __off64_t __offset, int __whence)); #else Wouldn't it be better to just compile with -D_FILE_OFFSET_BITS=64 on all libraries instead of using the LFS cruft? -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)
Satyam Sharma wrote: On 5/10/07, Xavier Bestel [EMAIL PROTECTED] wrote: On Thu, 2007-05-10 at 16:51 +0200, Jan Engelhardt wrote: (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is probably some strange mixup of Andrew Morton and Andi Kleen in your mind ;) What do the letters kp stand for? Heh ... I've always wanted to know that myself. It's funny, no one seems to have asked that on lkml during all these years (at least none that a Google search would throw up). Keep Patching ? Unlikely. akpm seems to be a pre-Linux-kernel nick. http://en.wikipedia.org/wiki/Andrew_Morton_%28computer_programmer%29 -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkinitrd and RAID6 on FC5
Guy Watkins wrote: Is this a REDHAT only problem/bug? If so, since bugzilla.redhat.com gets ignored, where do I complain? Yes, this is Redhat only, and as far as I know, it was fixed a long time ago. I suspect you need to make sure you upgrade your entire system, especially mkinitrd, not just the kernel. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkinitrd and RAID6 on FC5
Guy Watkins wrote: I tried to update/upgrade and no updates are available for mkinitrd. Do you know what version has the fix? The bugzilla was never closed, so it seems it has not been fixed. My version: mkinitrd.i3865.0.32-2 installed I guess Red Hat decided not to fix this in FC5. It does work in FC6; for FC5 I guess you're stuck passing --with=raid456 to mkinitrd :-/ -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mismatch_cnt questions
Andre Noll wrote: On 00:21, H. Peter Anvin wrote: I have just updated the paper at: http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf ... with this information (in slightly different notation and with a bit more detail.) There's a typo in the new section: s/By assumption, X_z != D_n/By assumption, X_z != D_z/ Thanks, fixed. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reshaping raid0/10
Neil Brown wrote: If I wanted to reshape a raid0, I would just morph it into a raid4 with a missing parity drive, then use the raid5 code to restripe it. Then morph it back to regular raid0. Wow, that made my brain hurt. Given the fact that we're going to have to do this on kernel.org soon, what would be the concrete steps involved (we're going to have to change 3-member raid0 into 4-member raid0)... -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mismatch_cnt questions
Bill Davidsen wrote: When last I looked at Hamming code, and that would be 1989 or 1990, I believe that I learned that the number of Hamming bits needed to cover N data bits was 1+log2(N), which for 512 bytes would be 1+12, and fit into a 16 bit field nicely. I don't know that I would go that way, fix any one bit error, detect any two bit error, rather than a CRC which gives me only one chance in 64k of an undetected data error, but I find it interesting. A Hamming code across the bytes of a sector is pretty darn pointless, since that's not a typical failure pattern. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1, hot-swap and boot integrity
Mike Accetta wrote: I gathered the impression somewhere, perhaps incorrectly, that the active flag was a function of the boot block, not the BIOS. We use Grub in the MBR and don't even have an active flag set in the partition table. The system still boots. The active flag is indeed an MBR issue. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mismatch_cnt questions
H. Peter Anvin wrote: Eyal Lebedinsky wrote: Neil Brown wrote: [trim Q re how resync fixes data] For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy and writing it over all other copies. For raid5 we assume the data is correct and update the parity. Can raid6 identify the bad block (two parity blocks could allow this if only one block has bad data in a stripe)? If so, does it? This will surely mean more value for raid6 than just the two-disk-failure protection. No. It's not mathematically possible. Okay, I've thought about it, and I got it wrong the first time (off-the-cuff misapplication of the pigeonhole principle.) It apparently *is* possible (for notation and algebra rules, see my paper): Let's assume we know exactly one of the data (Dn) drives is corrupt (ignoring the case of P or Q corruption for now.) That means instead of Dn we have a corrupt value, Xn. Note that which data drive that is corrupt (n) is not known. We compute P' and Q' as the computed values over the corrupt set. P+P' = Dn+Xn Q+Q' = g^n Dn + g^n Xn g = {02} Q+Q' = g^n (Dn+Xn) By assumption, Dn != Xn, so P+P' = Dn+Xn != {00}. g^n is *never* {00}, so Q+Q' = g^n (Dn+Xn) != {00}. (Q+Q')/(P+P') = [g^n (Dn+Xn)]/(Dn+Xn) = g^n Since n is known to be in the range [0,255), we thus have: n = log_g((Q+Q')/(P+P')) ... which is a well-defined relation. For the case where either the P or the Q drives are corrupt (and the data drives are all good), this is easily detected by the fact that if P is the corrupt drive, Q+Q' = {00}; similarly, if Q is the corrupt drive, P+P' = {00}. Obviously, if P+P' = Q+Q' = {00}, then as far as RAID-6 can discover, there is no corruption in the drive set. So, yes, RAID-6 *can* detect single drive corruption, and even tell you which drive it is, if you're willing to compute a full syndrome set (P', Q') on every read (as well on every write.) Note: RAID-6 cannot detect 2-drive corruption, unless of course the corruption is in different byte positions. If multiple corresponding byte positions are corrupt, then the algorithm above will generally point you to a completely innocent drive. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1, hot-swap and boot integrity
Mike Accetta wrote: I wonder if having the MBR typically outside of the array and the relative newness of partitioned arrays are related? When I was considering how to architect the RAID1 layout it seemed like a partitioned array on the entire disk worked most naturally. It's one way to do it, for sure. The main problem with that, of course, is that it's not compatible with other operating systems. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1, hot-swap and boot integrity
Mike Accetta wrote: I've been considering trying something like having the re-sync algorithm on a whole disk array defer the copy for sector 0 to the very end of the re-sync operation. Assuming the BIOS makes at least a minimal consistency check on sector 0 before electing to boot from the drive, this would keep it from selecting a partially re-sync'd drive that was not previously bootable. The only check that it will make is to look for 55 AA at the end of the MBR. Note that typically the MBR is not part of any of your MD volumes. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
James Bottomley wrote: On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote: 4104. It's 8 bytes per hardware sector. At least for T10... Er ... that won't look good to the 512 ATA compatibility remapping ... Well, in that case you'd only see 8x512 data bytes, no metadata... -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Theodore Tso wrote: In any case, the reason why I bring this up is that it would be really nice if there was a way with a single laptop drive to be able to do snapshots and background fsck's without having to use initrd's with device mapper. This is a major part of why I've been trying to push integrated klibc to have all that stuff as a unified kernel deliverable. Unfortunately, as you know, Linus apparently rejected the concept at least for now at LKS last year. With klibc this stuff could still be in one single wrapper without funny dependencies, but wouldn't have to be ported to kernel space. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Ric Wheeler wrote: We still have the following challenges: (1) read-ahead often means that we will retry every bad sector at least twice from the file system level. The first time, the fs read ahead request triggers a speculative read that includes the bad sector (triggering the error handling mechanisms) right before the real application triggers a read does the same thing. Not sure what the answer is here since read-ahead is obviously a huge win in the normal case. Probably the only sane thing to do is to remember the bad sectors and avoid attempting reading them; that would mean marking automatic versus explicitly requested requests to determine whether or not to filter them against a list of discovered bad blocks. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Andreas Dilger wrote: And clearing this list when the sector is overwritten, as it will almost certainly be relocated at the disk level. Certainly if the overwrite is successful. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
Richard Scobie wrote: Thought this paper may be of interest. A study done by Google on over 100,000 drives they have/had in service. http://labs.google.com/papers/disk_failures.pdf Bastards: Failure rates are known to be highly correlated with drive models, manufacturers and vintages [18]. Our results do not contradict this fact. For example, Figure 2 changes significantly when we normalize failure rates per each drive model. Most age-related results are impacted by drive vintages. However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [md] RAID6: clean up CPUID and FPU enter/exit code
My apologies for the screwed-up 'To:' line in the previous email... I did -s `head -1 file` instead of -s `head -1 file` by mistake [:^O -hpa (who is going to bed now...) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: strange raid6 assembly problem
Mickael Marchand wrote: so basically I don't really know what to do with my sdf3 at the moment and fear to reboot again :o) maybe a --re-add /dev/sdf3 could work here ? but will it survive a reboot ? At this point, for whatever reason, your kernel doesn't see /dev/sdf3 as part of the array. You could mdadm --add it, and yes, it should survive a reboot. Unless something is seriously goofy, of course, but that's impossible to determine from your trouble report. A RAID-6 in two-disk degraded mode often ends up needing two recovery passes (one to go from 2-1 and one from 1-0). This isn't a technical need, but is a result of the fact that unless you happen to have two hotspares standing by, the 2-1 recovery typically will have started by the time second disk is added. This may be the source of your strangeness. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux: Why software RAID?
Chris Friesen wrote: Jeff Garzik wrote: But anyway, to help answer the question of hardware vs. software RAID, I wrote up a page: http://linux.yyz.us/why-software-raid.html Just curious...with these guys (http://www.bigfootnetworks.com/KillerOverview.aspx) putting linux on a PCI NIC to allow them to bypass Windows' network stack, has anyone ever considered doing hardware raid by using an embedded cpu running linux software RAID, with battery-backed memory? It would theoretically allow you to remain feature-compatible by downloading new kernels to your RAID card. Yes. In fact, I have been told by several RAID chip vendors that their customers are *strongly* demanding that their chips be able to run Linux md (and still use whatever hardware offload features.) So it's happening. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple raids on one machine?
Chris Allen wrote: 2. Partition the raw disks into four partitions and make /dev/md0,md1,md2,md3. But am I heading for problems here? Is there going to be a big performance hit with four raid5 arrays on the same machine? Am I likely to have dataloss problems if my machine crashes? 2 should work just fine. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ok to go ahead with this setup?
Molle Bestefich wrote: Christian Pernegger wrote: Intel SE7230NH1-E mainboard Pentium D 930 HPA recently said that x86_64 CPUs have better RAID5 performance. Actually, anything with SSE2 should be OK. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ok to go ahead with this setup?
Molle Bestefich wrote: Christian Pernegger wrote: Anything specific wrong with the Maxtors? No. I've used Maxtor for a long time and I'm generally happy with them. They break now and then, but their online warranty system is great. I've also been treated kindly by their help desk - talked to a cute gal from Maxtor in Ireland over the phone just yesterday ;-). Then again, they've just been acquired by Seagate, or so, so things may change for the worse, who knows. I'd watch out regarding the Western Digital disks, apparently they have a bad habit of turning themselves off when used in RAID mode, for some reason: http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/1980/ I have exactly the opposite experience. More than 50% of Maxtor drives fail inside 18 months; WDs seem to be really solid. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Curious code in autostart_array
Pete Zaitcev wrote: Hi, guys: My copy of 2.6.17-rc5 has the following code in autostart_array(): mdp_disk_t *desc = sb-disks + i; dev_t dev = MKDEV(desc-major, desc-minor); if (!dev) continue; if (dev == startdev) continue; if (MAJOR(dev) != desc-major || MINOR(dev) != desc-minor) continue; Under what conditions do you think the last if() statement can fire? What is its purpose? This looks like an attempt to detect bit clipping. But what exactly? It can fire if either desc-major or desc-minor overflow the respective fields in dev_t. Unfortunately, it's not guaranteed to do so. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6
Followup to: [EMAIL PROTECTED] By author:=?GB2312?B?uPDQ29fK?= [EMAIL PROTECTED] In newsgroup: linux.dev.raid I am confronted with a big problem of the raid6 algorithm, when recently I learn the raid6 code of linux 2.6 you have contributed . Unfortunately I can not understand the algorithm of P +Q parity in this program . Is this some formula for this raid6 algorithm? I realy respect your help,could you show me some details about this algorithm? http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which CPU for XOR?
Followup to: [EMAIL PROTECTED] By author:Dexter Filmore [EMAIL PROTECTED] In newsgroup: linux.dev.raid What type of operation is XOR anyway? Should be ALU, right? So - what CPU is best for software raid? One with high integer processing power? Something with massive wide vector registers. PowerPC with Altivec totally kicks ass; x86-64 isn't too bad either. There are also some processors with builtin RAID accelerators; at least Intel, Broadcom and AMCC make them. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: And then there was Bryce...
Followup to: [EMAIL PROTECTED] By author:John Stoffel [EMAIL PROTECTED] In newsgroup: linux.dev.raid The problem is more likely that your /etc/mdadm/mdadm.conf file is specifying exactly which partitions to use, instead of just doing something like the following: DEVICE partitions ARRAY /dev/md0 level=raid1 auto=yes num-devices=2 UUID=2e078443:42b63ef5:cc179492:aecf0094 Which should do the trick for you. Can you post your mdadm.conf file so we can look it over? Hey guys, look at the syslog output again. He's using kernel autorun. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: And then there was Bryce...
Followup to: [EMAIL PROTECTED] By author:Henrik Holst [EMAIL PROTECTED] In newsgroup: linux.dev.raid The same happened to me with eth0-2. I _could_ not for my life understand why I didn't get internet connect to work. But then I realized that eth0 and eth1 had been swapped after I upgraded to udev. Please advice your distribution udev documentation how to lock down scsi and network cards to specific kernel names. This doesn't explain how come it bound drives without superblocks. It should only bind drives with the correct superblock UUID, EVER. Udev doesn't actually matter here, since the kernel, not udev, assigns the numbers to the drives. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with large devices 2TB
Followup to: [EMAIL PROTECTED] By author:Jim Klimov [EMAIL PROTECTED] In newsgroup: linux.dev.raid Since the new parted worked ok (older one didn't), we were happy until we tried a reboot. During the device initialization and after it the system only recognises the 6 or 7 partitions which start before the 2000Gb limit: For a DOS partition table, there is no such thing as a partition starting beyond 2 TB. You need to use a GPT or other more sophisticated partition table. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 005 of 11] md: Merge raid5 and raid6 code
NeilBrown wrote: There is a lot of commonality between raid5.c and raid6main.c. This patches merges both into one module called raid456. This saves a lot of code, and paves the way for online raid5-raid6 migrations. There is still duplication, e.g. between handle_stripe5 and handle_stripe6. This will probably be cleaned up later. Cc: H. Peter Anvin [EMAIL PROTECTED] Signed-off-by: Neil Brown [EMAIL PROTECTED] Wonderful! Thank you for doing this :) -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [klibc] Re: Exporting which partitions to md-configure
Luca Berra wrote: This, in fact is *EXACTLY* what we're talking about; it does require autoassemble. Why do we care about the partition types at all? The reason is that since the md superblock is at the end, it doesn't get automatically wiped if the partition is used as a raw filesystem, and so it's important that there is a qualifier for it. I don't like using partition type as a qualifier, there is people who do not wish to partition their drives, there are systems not supporting msdos like partitions, heck even m$ is migrating away from those. That's why we're talking about non-msdos partitioning schemes. In any case if that has to be done it should be done into mdadm, not in a different scrip that is going to call mdadm (behaviour should be consistent between mdadm invoked by initramfs and mdadm invoked on a running system). Agreed. mdadm is the best place for it. If the user wants to reutilize a device that was previously a member of an md array he/she should use mdadm --zero-superblock to remove the superblock. I see no point in having a system that tries to compensate for users not following correct procedures. sorry. You don't? That surprises me... making it harder for the user to have accidental data loss sounds like a very good thing to me. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [klibc] Re: Exporting which partitions to md-configure
Luca Berra wrote: making it harder for the user is a good thing, but please not at the expense of usability What's the usability problem? -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [klibc] Re: Exporting which partitions to md-configure
Neil Brown wrote: What constitutes 'a piece of data'? A bit? a byte? I would say that msdos:fd is one piece of data. The 'fd' is useless without the 'msdos'. The 'msdos' is, I guess, not completely useless with the fd. I would lean towards the composite, but I wouldn't fight a separation. Well, the two pieces come from different sources. Just as there is a direct unambiguous causal path from something present at early boot to the root filesystem that is mounted (and the root filesystem specifies all other filesystems through fstab) similarly there should be an unambiguous causal path from something present at early boot to the array which holds the root filesystem - and the root filesystem should describe all other arrays via mdadm.conf Does that make sense? It makes sense, but I disagree. I believe you are correct in that the current preferred minor bit causes an invalid assumption that, e.g. /dev/md3 is always a certain thing, but since each array has a UUID, and one should be able to mount by either filesystem UUID or array UUID, there should be no need for such a conflict if one allows for dynamic md numbers. Requiring that mdadm.conf describes the actual state of all volumes would be an enormous step in the wrong direction. Right now, the Linux md system can handle some very oddball hardware changes (such as on hera.kernel.org, when the disks not just completely changed names due to a controller change, but changed from hd* to sd*!) Dynamicity is a good thing, although it needs to be harnessed. kernel parameter md_root_uuid=xxyy:zzyy:aabb:ccdd... This could be interpreted by an initramfs script to run mdadm to find and assemble the array with that uuid. The uuid of each array is reasonably unique. This, in fact is *EXACTLY* what we're talking about; it does require autoassemble. Why do we care about the partition types at all? The reason is that since the md superblock is at the end, it doesn't get automatically wiped if the partition is used as a raw filesystem, and so it's important that there is a qualifier for it. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exporting which partitions to md-configure
Neil Brown wrote: On Monday January 30, [EMAIL PROTECTED] wrote: Any feeling how best to do that? My current thinking is to export a flags entry in addition to the current ones, presumably based on struct parsed_partitions-parts[].flags (fs/partitions/check.h), which seems to be what causes md_autodetect_dev() to be called. I think I would prefer a 'type' attribute in each partition that records the 'type' from the partition table. This might be more generally useful than just for md. Then your userspace code would have to look for '253' and use just those partitions. What about non-DOS partitions? -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exporting which partitions to md-configure
Kyle Moffett wrote: Well, for an MSDOS partition table, you would look for '253', for a Mac partition table you could look for something like 'Linux_RAID' or similar (just arbitrarily define some name beginning with the Linux_ prefix), etc. This means that the partition table type would need to be exposed as well (I don't know if it is already). It's not, but perhaps exporting format and type as distinct attributes is the way to go. The policy for which partitions to consider would live entirely in kinit that way. type would be format-specific; in EFI it's a UUID. This, of course, is a bigger change, but it just might be worth it. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exporting which partitions to md-configure
Neil Brown wrote: Well, grepping through fs/partitions/*.c, the 'flags' thing is set by efi.c, msdos.c sgi.c sun.c Of these, efi compares something against PARTITION_LINUX_RAID_GUID, and msdos.c, sgi.c and sun. compare something against LINUX_RAID_PARTITION. The former would look like e6d6d379-f507-44c2-a23c-238f2a3df928 in sysfs (I think); The latter would look like fd (I suspect). These are both easily recognisable with no real room for confusion. Well, if we're going to have a generic facility it should make sense across the board. If all we're doing is supporting legacy usage we might as well export a flag. I guess we could have a single entry with a string of the form efi:e6d6d379-f507-44c2-a23c-238f2a3df928 or msdos:fd etc -- it really doesn't make any difference to me, but it seems cleaner to have two pieces of data in two different sysfs entries. And if other partition styles wanted to add support for raid auto detect, tell them no. It is perfectly possible and even preferable to live without autodetect. We should support legacy usage (those above) but should discourage any new usage. Why is that, keeping in mind this will all be done in userspace? -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exporting which partitions to md-configure
Neil Brown wrote: Mac partition tables doesn't currently support autodetect (as far as I can tell). Let's keep it that way. For now I guess I'll just take the code from init/do_mounts_md.c; we can worry about ripping the RAID_AUTORUN code out of the kernel later. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adding Reed-Solomon Personality to MD, need help/advice
Jeff Breidenbach wrote: The fundamental problem is that generic RS requires table lookups even in the common case, whereas RAID-6 uses shortcuts to substantially speed up the computation in the common case. If one wanted to support a typical 8-bit RS code (which supports a max of 256 drives, including ECC drives) it is already way too big to use a table. RS is typically done with finite field math calculations which are - relatively - fast but they are much heavier than a parity calculation. Here is one commercial benchmark, note the throughput numbers at the bottom of the page. Well, most of them are implemented via tables (GF log table, etc.) They tend to perform poorly on modern hardware. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EVMS or md?
Followup to: [EMAIL PROTECTED] By author:David Kewley [EMAIL PROTECTED] In newsgroup: linux.dev.raid Mike Tran wrote on Monday 04 April 2005 12:28: We (EVMS team) intended to support RAID6 last year. But as we all remember RAID6 was not stable then. I may write a plugin to support RAID6 soon. Hi Mike, In your view, is RAID6 now considered stable? How soon might you have an evms plugin for it? ;) I'd love to use evms on my new filserver if it supported RAID6. I can't speak for the EVMS people, but I got to stress-test my RAID6 test system some this weekend; after having run in 1-disk degraded mode for several months (thus showing that the big bad degraded write bug has been thoroughly fixed) I changed the motherboard, and the kernel didn't support one of the controllers. And now there were 2 missing drives. Due to some bootloader problems, I ended up yo-yoing between the two kernels a bit more than I intended to, and went through quite a few RAID disk losses and rebuilds as a result. No hiccups, data losses, or missing functionality. At the end of the whole ordeal, the filesystem (1 TB, 50% full) was still quite prisine, and fsck confirmed this. I was quite pleased :) Oh, and doing the N-2 - N-1 rebuild is slow (obviously), but not outrageously so. It rebuilt the 1 TB array in a matter of single-digit hours. CPU utilitization was quite high, obviously, but it didn't cripple the system by any means. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Forcing a more random uuid (random seed bug)
Followup to: [EMAIL PROTECTED] By author:Niccolo Rigacci [EMAIL PROTECTED] In newsgroup: linux.dev.raid I get /dev/md5, /dev/md6, /dev/md7 and /dev/md8 all with the same UUID! It seems that there is a bug in mdadm: when generating the UUID for a volume, the random() function is called, but the random sequence is never initialized. The result is that every volume created with mdadm has an uuid of: 6b8b4567:327b23c6:643c9869:66334873 See also Debian bug 292784 at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=292784 I fixed the problem adding the following patch to mdadm.c, but please bear in mind that I'm totally unaware of mdadm code and quite naive in C programming: Please don't use (s)random at all, except as a possible fallback to /dev/(u)random. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Forcing a more random uuid (random seed bug)
Followup to: [EMAIL PROTECTED] By author:[EMAIL PROTECTED] In newsgroup: linux.dev.raid +if ((my_fd = open(/dev/random, O_RDONLY)) != -1) { Please use /dev/urandom for such applications. /dev/random is the highest-quality generator, but will block if entropy isn't available. /dev/urandom provides the best available, immediately, which is what this application wants. Not 100% clear; the best would be to make it configurable. Either way you must not use read() in the way described. Short reads happen, even with /dev/urandom. Also, this will only produce 2^32 possible UUIDs, since that's the size of the seed. Meaning that after you've generated 2^16 of them, the chances are excellent that they're not UU any more. You might just want to get all 128 (minus epsilon) bits from /dev/urandom directly. You *do* want to get all bits from /dev/urandom directly. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] RAID Kconfig cleanups, remove experimental tag from RAID-6
This patch removes the experimental tag from RAID-6 (unfortunately the damage is already done...:-|) and cleans up a few more things in the Kconfig file. Signed-Off-By: H. Peter Anvin [EMAIL PROTECTED] Index: linux-2.5/drivers/md/Kconfig === RCS file: /home/hpa/kernel/bkcvs/linux-2.5/drivers/md/Kconfig,v retrieving revision 1.17 diff -u -r1.17 Kconfig --- linux-2.5/drivers/md/Kconfig 15 Jan 2005 23:46:55 - 1.17 +++ linux-2.5/drivers/md/Kconfig 8 Feb 2005 22:02:42 - @@ -93,7 +93,7 @@ mirroring (RAID-1) with easier configuration and more flexable layout. Unlike RAID-0, but like RAID-1, RAID-10 requires all devices to - be the same size (or atleast, only as much as the smallest device + be the same size (or at least, only as much as the smallest device will be used). RAID-10 provides a variety of layouts that provide different levels of redundancy and performance. @@ -102,6 +102,7 @@ ftp://ftp.kernel.org/pub/linux/utils/raid/mdadm/ + If unsure, say Y. config MD_RAID5 tristate RAID-4/RAID-5 mode @@ -120,20 +121,16 @@ http://www.tldp.org/docs.html#howto. There you will also learn where to get the supporting user space utilities raidtools. - If you want to use such a RAID-4/RAID-5 set, say Y. To compile - this code as a module, choose M here: the module will be called raid5. + If you want to use such a RAID-4/RAID-5 set, say Y. To + compile this code as a module, choose M here: the module + will be called raid5. If unsure, say Y. config MD_RAID6 - tristate RAID-6 mode (EXPERIMENTAL) - depends on BLK_DEV_MD EXPERIMENTAL + tristate RAID-6 mode + depends on BLK_DEV_MD ---help--- - WARNING: RAID-6 is currently highly experimental. If you - use it, there is no guarantee whatsoever that it won't - destroy your data, eat your disk drives, insult your mother, - or re-appoint George W. Bush. - A RAID-6 set of N drives with a capacity of C MB per drive provides the capacity of C * (N - 2) MB, and protects against a failure of any two drives. For a given sector @@ -150,7 +147,7 @@ this code as a module, choose M here: the module will be called raid6. - If unsure, say N. + If unsure, say Y. config MD_MULTIPATH tristate Multipath I/O support
Re: [PATCH md 2 of 4] Fix raid6 problem
Lars Marowsky-Bree wrote: On 2005-02-03T08:39:41, H. Peter Anvin [EMAIL PROTECTED] wrote: Yes, right now there is no RAID5-RAID6 conversion tool that I know of. Hm. One of the checksums is identical, as is the disk layout of the data, no? No, the layout is different. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 2 of 4] Fix raid6 problem
Guy wrote: Would you say that the 2.6 Kernel is suitable for storing mission-critical data, then? Sure. I'd trust 2.6 over 2.4 at this point. I ask because I have read about a lot of problems with data corruption and oops on this list and the SCSI list. But in most or all cases the 2.4 Kernel does not have the same problem. I haven't seen any problems like that, including on kernel.org, which is definitely a high demand site. Who out there has a RAID6 array that they believe is stable and safe? And please give some details about the array. Number of disks, sizes, LVM, FS, SCSI, ATA and anything else you can think of? Also, details about any disk failures and how well recovery went? The one I have is a 6-disk ATA array (6x250 GB), ext3. Had one disk failure which hasn't been replaced yet; it's successfully running in 1-disk degraded mode. I'll let other people speak for themselves. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 2 of 4] Fix raid6 problem
Followup to: [EMAIL PROTECTED] By author:A. James Lewis [EMAIL PROTECTED] In newsgroup: linux.dev.raid Sorry for the delay in replying, I've been using RAID6 in a real life situation with 2.6.9 + patch, for 2 months now, with 1.15Tb of storage, and I have had more than 1 drive failure... as well as some rather embarasing hardware corruption which I traced to a faulty IDE controller. Dispite some random DMA corrupion, and loosing a total of 3 disks, I have not had any problems with it RAID6 itself, and really it has litereally saved my data from being lost. I ran a diff against the 2.6.9 patch and what is in 2.6.10... and they are not the same, presumably a more elegant fix has been implimented for the production kernel?? I think there are some other (generic) fixes in there too. Anyway... I'm thinking of sending in a patch to take out the experimental status of RAID-6. I have been running a 1 TB production server in 1-disk degraded mode for about a month now without incident. -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html