Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
On Mon, 14 Jan 2008, NeilBrown wrote: raid5's 'make_request' function calls generic_make_request on underlying devices and if we run out of stripe heads, it could end up waiting for one of those requests to complete. This is bad as recursive calls to generic_make_request go on a queue and are not even attempted until make_request completes. So: don't make any generic_make_request calls in raid5 make_request until all waiting has been done. We do this by simply setting STRIPE_HANDLE instead of calling handle_stripe(). If we need more stripe_heads, raid5d will get called to process the pending stripe_heads which will call generic_make_request from a different thread where no deadlock will happen. This change by itself causes a performance hit. So add a change so that raid5_activate_delayed is only called at unplug time, never in raid5. This seems to bring back the performance numbers. Calling it in raid5d was sometimes too soon... Cc: Dan Williams [EMAIL PROTECTED] Signed-off-by: Neil Brown [EMAIL PROTECTED] probably doesn't matter, but for the record: Tested-by: dean gaudet [EMAIL PROTECTED] this time i tested with internal and external bitmaps and it survived 8h and 14h resp. under the parallel tar workload i used to reproduce the hang. btw this should probably be a candidate for 2.6.22 and .23 stable. thanks -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5
On Tue, 15 Jan 2008, Andrew Morton wrote: On Tue, 15 Jan 2008 21:01:17 -0800 (PST) dean gaudet [EMAIL PROTECTED] wrote: On Mon, 14 Jan 2008, NeilBrown wrote: raid5's 'make_request' function calls generic_make_request on underlying devices and if we run out of stripe heads, it could end up waiting for one of those requests to complete. This is bad as recursive calls to generic_make_request go on a queue and are not even attempted until make_request completes. So: don't make any generic_make_request calls in raid5 make_request until all waiting has been done. We do this by simply setting STRIPE_HANDLE instead of calling handle_stripe(). If we need more stripe_heads, raid5d will get called to process the pending stripe_heads which will call generic_make_request from a different thread where no deadlock will happen. This change by itself causes a performance hit. So add a change so that raid5_activate_delayed is only called at unplug time, never in raid5. This seems to bring back the performance numbers. Calling it in raid5d was sometimes too soon... Cc: Dan Williams [EMAIL PROTECTED] Signed-off-by: Neil Brown [EMAIL PROTECTED] probably doesn't matter, but for the record: Tested-by: dean gaudet [EMAIL PROTECTED] this time i tested with internal and external bitmaps and it survived 8h and 14h resp. under the parallel tar workload i used to reproduce the hang. btw this should probably be a candidate for 2.6.22 and .23 stable. hm, Neil said The first fixes a bug which could make it a candidate for 24-final. However it is a deadlock that seems to occur very rarely, and has been in mainline since 2.6.22. So letting it into one more release shouldn't be a big problem. While the fix is fairly simple, it could have some unexpected consequences, so I'd rather go for the next cycle. food fight! heheh. it's really easy to reproduce the hang without the patch -- i could hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB. i'll try with ext3... Dan's experiences suggest it won't happen with ext3 (or is even more rare), which would explain why this has is overall a rare problem. but it doesn't result in dataloss or permanent system hangups as long as you can become root and raise the size of the stripe cache... so OK i agree with Neil, let's test more... food fight over! :) -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Thu, 10 Jan 2008, Neil Brown wrote: On Wednesday January 9, [EMAIL PROTECTED] wrote: On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote: i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 which was Neil's change in 2.6.22 for deferring generic_make_request until there's enough stack space for it. Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization by preventing recursive calls to generic_make_request. However the following conditions can cause raid5 to hang until 'stripe_cache_size' is increased: Thanks for pursuing this guys. That explanation certainly sounds very credible. The generic_make_request_immed is a good way to confirm that we have found the bug, but I don't like it as a long term solution, as it just reintroduced the problem that we were trying to solve with the problematic commit. As you say, we could arrange that all request submission happens in raid5d and I think this is the right way to proceed. However we can still take some of the work into the thread that is submitting the IO by calling raid5d() at the end of make_request, like this. Can you test it please? Does it seem reasonable? Thanks, NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] it has passed 11h of the untar/diff/rm linux.tar.gz workload... that's pretty good evidence it works for me. thanks! Tested-by: dean gaudet [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c|2 +- ./drivers/md/raid5.c |4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2008-01-07 13:32:10.0 +1100 +++ ./drivers/md/md.c 2008-01-10 11:08:02.0 +1100 @@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev) if (mddev-ro) return; - if (signal_pending(current)) { + if (current == mddev-thread-tsk signal_pending(current)) { if (mddev-pers-sync_request) { printk(KERN_INFO md: %s in immediate safe mode\n, mdname(mddev)); diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c --- .prev/drivers/md/raid5.c 2008-01-07 13:32:10.0 +1100 +++ ./drivers/md/raid5.c 2008-01-10 11:06:54.0 +1100 @@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req } } +static void raid5d (mddev_t *mddev); static int make_request(struct request_queue *q, struct bio * bi) { @@ -3547,7 +3548,7 @@ static int make_request(struct request_q goto retry; } finish_wait(conf-wait_for_overlap, w); - handle_stripe(sh, NULL); + set_bit(STRIPE_HANDLE, sh-state); release_stripe(sh); } else { /* cannot get stripe for read-ahead, just give-up */ @@ -3569,6 +3570,7 @@ static int make_request(struct request_q test_bit(BIO_UPTODATE, bi-bi_flags) ? 0 : -EIO); } + raid5d(mddev); return 0; } - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Fri, 11 Jan 2008, Neil Brown wrote: Thanks. But I suspect you didn't test it with a bitmap :-) I ran the mdadm test suite and it hit a problem - easy enough to fix. damn -- i lost my bitmap 'cause it was external and i didn't have things set up properly to pick it up after a reboot :) if you send an updated patch i'll give it another spin... -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid 1, can't get the second disk added back in.
On Tue, 8 Jan 2008, Bill Davidsen wrote: Neil Brown wrote: On Monday January 7, [EMAIL PROTECTED] wrote: Problem is not raid, or at least not obviously raid related. The problem is that the whole disk, /dev/hdb is unavailable. Maybe check /sys/block/hdb/holders ? lsof /dev/hdb ? good luck :-) losetup -a may help, lsof doesn't seem to show files used in loop mounts. Yes, long shot... and don't forget dmsetup ls... (followed immediately by apt-get remove evms if you're on an unfortunate version of ubuntu which helpfully installed that partition-stealing service for you.) -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. This is just brainstorming at this point, but it looks like xfs can submit more requests in the bi_end_io path such that it can lock itself out of the RAID array. The sequence that concerns me is: return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang I need verify whether this path is actually triggering, but if we are in an inactive_blocked condition this new request will be put on a wait queue and we'll never get to the release_stripe() call after return_io(). It would be interesting to see if this is new XFS behavior in recent kernels. i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 which was Neil's change in 2.6.22 for deferring generic_make_request until there's enough stack space for it. with my git tree sync'd to that commit my test cases fail in under 20 minutes uptime (i rebooted and tested 3x). sync'd to the commit previous to it i've got 8h of run-time now without the problem. this isn't definitive of course since it does seem to be timing dependent, but since all failures have occured much earlier than that for me so far i think this indicates this change is either the cause of the problem or exacerbates an existing raid5 problem. given that this problem looks like a very rare problem i saw with 2.6.18 (raid5+xfs there too) i'm thinking Neil's commit may just exacerbate an existing problem... not that i have evidence either way. i've attached a new kernel log with a hang at d89d87965d... and the reduced config file i was using for the bisect. hopefully the hang looks the same as what we were seeing at 2.6.24-rc6. let me know. -dean kern.log.d89d87965d.bz2 Description: Binary data config-2.6.21-b1.bz2 Description: Binary data
Re: [patch] improve stripe_cache_size documentation
On Sun, 30 Dec 2007, Thiemo Nagel wrote: stripe_cache_size (currently raid5 only) As far as I have understood, it applies to raid6, too. good point... and raid4. here's an updated patch. -dean Signed-off-by: dean gaudet [EMAIL PROTECTED] Index: linux/Documentation/md.txt === --- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800 +++ linux/Documentation/md.txt 2007-12-30 10:16:58.0 -0800 @@ -435,8 +435,14 @@ These currently include - stripe_cache_size (currently raid5 only) + stripe_cache_size (raid4, raid5 and raid6) number of entries in the stripe cache. This is writable, but there are upper and lower limits (32768, 16). Default is 128. - strip_cache_active (currently raid5 only) + + The stripe cache memory is locked down and not available for other uses. + The total size of the stripe cache is determined by this formula: + +PAGE_SIZE * raid_disks * stripe_cache_size + + strip_cache_active (raid4, raid5 and raid6) number of active entries in the stripe cache - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] improve stripe_cache_size documentation
On Sun, 30 Dec 2007, dean gaudet wrote: On Sun, 30 Dec 2007, Thiemo Nagel wrote: stripe_cache_size (currently raid5 only) As far as I have understood, it applies to raid6, too. good point... and raid4. here's an updated patch. and once again with a typo fix. oops. -dean Signed-off-by: dean gaudet [EMAIL PROTECTED] Index: linux/Documentation/md.txt === --- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800 +++ linux/Documentation/md.txt 2007-12-30 14:30:40.0 -0800 @@ -435,8 +435,14 @@ These currently include - stripe_cache_size (currently raid5 only) + stripe_cache_size (raid4, raid5 and raid6) number of entries in the stripe cache. This is writable, but there are upper and lower limits (32768, 16). Default is 128. - strip_cache_active (currently raid5 only) + + The stripe cache memory is locked down and not available for other uses. + The total size of the stripe cache is determined by this formula: + +PAGE_SIZE * raid_disks * stripe_cache_size + + stripe_cache_active (raid4, raid5 and raid6) number of active entries in the stripe cache - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to hit that limit too if i try harder :) btw what units are stripe_cache_size/active in? is the memory consumed equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * raid_disks * stripe_cache_active)? -dean On Thu, 27 Dec 2007, dean gaudet wrote: hmm this seems more serious... i just ran into it with chunksize 64KiB and while just untarring a bunch of linux kernels in parallel... increasing stripe_cache_size did the trick again. -dean On Thu, 27 Dec 2007, dean gaudet wrote: hey neil -- remember that raid5 hang which me and only one or two others ever experienced and which was hard to reproduce? we were debugging it well over a year ago (that box has 400+ day uptime now so at least that long ago :) the workaround was to increase stripe_cache_size... i seem to have a way to reproduce something which looks much the same. setup: - 2.6.24-rc6 - system has 8GiB RAM but no swap - 8x750GB in a raid5 with one spare, chunksize 1024KiB. - mkfs.xfs default options - mount -o noatime - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440 that sequence hangs for me within 10 seconds... and i can unhang / rehang it by toggling between stripe_cache_size 256 and 1024. i detect the hang by watching iostat -kx /dev/sd? 5. i've attached the kernel log where i dumped task and timer state while it was hung... note that you'll see at some point i did an xfs mount with external journal but it happens with internal journal as well. looks like it's using the raid456 module and async api. anyhow let me know if you need more info / have any suggestions. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
On Tue, 25 Dec 2007, Bill Davidsen wrote: The issue I'm thinking about is hardware sector size, which on modern drives may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle when writing a 512b block. i'm not sure any shipping SATA disks have larger than 512B sectors yet... do you know of any? (or is this thread about SCSI which i don't pay attention to...) on a brand new WDC WD7500AAKS-00RBA0 with this partition layout: 255 heads, 63 sectors/track, 91201 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes so sda1 starts at a non-multiple of 4096 into the disk. i ran some random seek+write experiments using http://arctic.org/~dean/randomio/, here are the results using 512 byte and 4096 byte writes (fsync after each write), 8 threads, on sda1: # ./randomio /dev/sda1 8 1 1 512 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 148.5 |0.0 infnan0.0nan | 148.5 0.2 53.7 89.3 19.5 129.2 |0.0 infnan0.0nan | 129.2 37.2 61.9 96.79.3 131.2 |0.0 infnan0.0nan | 131.2 40.3 61.0 90.49.3 132.0 |0.0 infnan0.0nan | 132.0 39.6 60.6 89.39.1 130.7 |0.0 infnan0.0nan | 130.7 39.8 61.3 98.18.9 131.4 |0.0 infnan0.0nan | 131.4 40.0 60.8 101.09.6 # ./randomio /dev/sda1 8 1 1 4096 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 141.7 |0.0 infnan0.0nan | 141.7 0.3 56.3 99.3 21.1 132.4 |0.0 infnan0.0nan | 132.4 43.3 60.4 91.88.5 131.6 |0.0 infnan0.0nan | 131.6 41.4 60.9 111.09.6 131.8 |0.0 infnan0.0nan | 131.8 41.4 60.7 85.38.6 130.6 |0.0 infnan0.0nan | 130.6 41.7 61.3 95.09.4 131.4 |0.0 infnan0.0nan | 131.4 42.2 60.8 90.58.4 i think the anomalous results in the first 10s samples are perhaps the drive coming out of a standby state. and here are the results aligned using the sda raw device itself: # ./randomio /dev/sda 8 1 1 512 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 147.3 |0.0 infnan0.0nan | 147.3 0.3 54.1 93.7 20.1 132.4 |0.0 infnan0.0nan | 132.4 37.4 60.6 91.89.2 132.5 |0.0 infnan0.0nan | 132.5 37.7 60.3 93.79.3 131.8 |0.0 infnan0.0nan | 131.8 39.4 60.7 92.79.0 133.9 |0.0 infnan0.0nan | 133.9 41.7 59.8 90.78.5 130.2 |0.0 infnan0.0nan | 130.2 40.8 61.5 88.68.9 # ./randomio /dev/sda 8 1 1 4096 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 145.4 |0.0 infnan0.0nan | 145.4 0.3 54.9 94.0 20.1 130.3 |0.0 infnan0.0nan | 130.3 36.0 61.4 92.79.6 130.6 |0.0 infnan0.0nan | 130.6 38.2 61.2 96.79.2 132.1 |0.0 infnan0.0nan | 132.1 39.0 60.5 93.59.2 131.8 |0.0 infnan0.0nan | 131.8 43.1 60.8 93.89.1 129.0 |0.0 infnan0.0nan | 129.0 40.2 62.0 96.48.8 it looks pretty much the same to me... -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. i'm doing some more isolation but just grabbing kernels i have precompiled so far -- a 2.6.19.7 kernel doesn't show the problem, and early indications are a 2.6.21.7 kernel also doesn't have the problem but i'm giving it longer to show its head. i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just so we get the debian patches out of the way. i was tempted to blame async api because it's newish :) but according to the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async API, and it still hung, so async is probably not to blame. anyhow the test case i'm using is the dma_thrasher script i attached... it takes about an hour to give me confidence there's no problems so this will take a while. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] improve stripe_cache_size documentation
Document the amount of memory used by the stripe cache and the fact that it's tied down and unavailable for other purposes (right?). thanks to Dan Williams for the formula. -dean Signed-off-by: dean gaudet [EMAIL PROTECTED] Index: linux/Documentation/md.txt === --- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800 +++ linux/Documentation/md.txt 2007-12-29 13:04:17.0 -0800 @@ -438,5 +438,11 @@ stripe_cache_size (currently raid5 only) number of entries in the stripe cache. This is writable, but there are upper and lower limits (32768, 16). Default is 128. + + The stripe cache memory is locked down and not available for other uses. + The total size of the stripe cache is determined by this formula: + +PAGE_SIZE * raid_disks * stripe_cache_size + strip_cache_active (currently raid5 only) number of active entries in the stripe cache - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, Justin Piszcz wrote: Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with? mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1 mkfs.xfs -f /dev/md2 otherwise defaults The script you sent out earlier, you are able to reproduce it easily with 31 or so kernel tar decompressions? not sure, the point of the script is to untar more than there is RAM. it happened with a single rsync running though -- 3.5M indoes from a remote box. it also happens with the single 10GB dd write... although i've been using the tar method for testing different kernel revs. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, dean gaudet wrote: On Sat, 29 Dec 2007, Justin Piszcz wrote: Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with? mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1 mkfs.xfs -f /dev/md2 otherwise defaults hmm i missed a few things, here's exactly how i created the array: mdadm --create --level=5 --chunk=64 -n7 -x1 --assume-clean /dev/md2 /dev/sd[a-h]1 it's reassembled automagically each reboot, but i do this each reboot: mkfs.xfs -f /dev/md2 mount -o noatime /dev/md2 /mnt/new ./dma_thrasher linux.tar.gz /mnt/new the --assume-clean and noatime probably make no difference though... on the bisection front it looks like it's new behaviour between 2.6.21.7 and 2.6.22.15 (stock kernels now, not debian). i've got to step out for a while, but i'll go at it again later, probably with git bisect unless someone has some cherry picked changes to suggest. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
hmm this seems more serious... i just ran into it with chunksize 64KiB and while just untarring a bunch of linux kernels in parallel... increasing stripe_cache_size did the trick again. -dean On Thu, 27 Dec 2007, dean gaudet wrote: hey neil -- remember that raid5 hang which me and only one or two others ever experienced and which was hard to reproduce? we were debugging it well over a year ago (that box has 400+ day uptime now so at least that long ago :) the workaround was to increase stripe_cache_size... i seem to have a way to reproduce something which looks much the same. setup: - 2.6.24-rc6 - system has 8GiB RAM but no swap - 8x750GB in a raid5 with one spare, chunksize 1024KiB. - mkfs.xfs default options - mount -o noatime - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440 that sequence hangs for me within 10 seconds... and i can unhang / rehang it by toggling between stripe_cache_size 256 and 1024. i detect the hang by watching iostat -kx /dev/sd? 5. i've attached the kernel log where i dumped task and timer state while it was hung... note that you'll see at some point i did an xfs mount with external journal but it happens with internal journal as well. looks like it's using the raid456 module and async api. anyhow let me know if you need more info / have any suggestions. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Thu, 27 Dec 2007, Justin Piszcz wrote: With that high of a stripe size the stripe_cache_size needs to be greater than the default to handle it. i'd argue that any deadlock is a bug... regardless i'm still seeing deadlocks with the default chunk_size of 64k and stripe_cache_size of 256... in this case it's with a workload which is untarring 34 copies of the linux kernel at the same time. it's a variant of doug ledford's memtest, and i've attached it. -dean#!/usr/bin/perl # Copyright (c) 2007 dean gaudet [EMAIL PROTECTED] # # Permission is hereby granted, free of charge, to any person obtaining a # copy of this software and associated documentation files (the Software), # to deal in the Software without restriction, including without limitation # the rights to use, copy, modify, merge, publish, distribute, sublicense, # and/or sell copies of the Software, and to permit persons to whom the # Software is furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included # in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL # THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR # OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, # ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR # OTHER DEALINGS IN THE SOFTWARE. # this idea shamelessly stolen from doug ledford use warnings; use strict; # ensure stdout is not buffered select(STDOUT); $| = 1; my $usage = usage: $0 linux.tar.gz /path1 [/path2 ...]\n; defined(my $tarball = shift) or die $usage; -f $tarball or die $tarball does not exist or is not a file\n; my @paths = @ARGV; $#paths = 0 or die $usage; # determine size of uncompressed tarball open(GZIP, -|) || exec gzip, --quiet, --list, $tarball; my $line = GZIP; my ($tarball_size) = $line =~ m#^\s*\d+\s*(\d+)#; defined($tarball_size) or die unexpected result from gzip --quiet --list $tarball\n; close(GZIP); # determine amount of memory open(MEMINFO, /proc/meminfo) or die unable to open /proc/meminfo for read: $!\n; my $total_mem; while (MEMINFO) { if (/^MemTotal:\s*(\d+)\s*kB/) { $total_mem = $1; last; } } defined($total_mem) or die did not find MemTotal line in /proc/meminfo\n; close(MEMINFO); $total_mem *= 1024; print total memory: $total_mem\n; print uncompressed tarball: $tarball_size\n; my $nr_simultaneous = int(1.2 * $total_mem / $tarball_size); print nr simultaneous processes: $nr_simultaneous\n; sub system_or_die { my @args = @_; system(@args); if ($? == 1) { my $msg = sprintf(%s failed to exec %s: $!\n, scalar(localtime), $args[0]); } elsif ($? 127) { my $msg = sprintf(%s %s died with signal %d, %s coredump\n, scalar(localtime), $args[0], ($? 127), ($? 128) ? with : without); die $msg; } elsif (($? 8) != 0) { my $msg = sprintf(%s %s exited with non-zero exit code %d\n, scalar(localtime), $args[0], $? 8); die $msg; } } sub untar($) { mkdir($_[0]) or die localtime(). unable to mkdir($_[0]): $!\n; system_or_die(tar, -xzf, $tarball, -C, $_[0]); } print localtime(). untarring golden copy\n; my $golden = $paths[0]./dma_tmp.$$.gold; untar($golden); my $pass_no = 0; while (1) { print localtime(). pass $pass_no: extracting\n; my @outputs; foreach my $n (1..$nr_simultaneous) { # treat paths in a round-robin manner my $dir = shift(@paths); push(@paths, $dir); $dir .= /dma_tmp.$$.$n; push(@outputs, $dir); my $pid = fork; defined($pid) or die localtime(). unable to fork: $!\n; if ($pid == 0) { untar($dir); exit(0); } } # wait for the children while (wait != -1) {} print localtime(). pass $pass_no: diffing\n; foreach my $dir (@outputs) { my $pid = fork; defined($pid) or die localtime(). unable to fork: $!\n; if ($pid == 0) { system_or_die(diff, -U, 3, -rN, $golden, $dir); system_or_die(rm, -fr, $dir); exit(0); } } # wait for the children while (wait != -1) {} ++$pass_no; }
Re: external bitmaps.. and more
On Thu, 6 Dec 2007, Michael Tokarev wrote: I come across a situation where external MD bitmaps aren't usable on any standard linux distribution unless special (non-trivial) actions are taken. First is a small buglet in mdadm, or two. It's not possible to specify --bitmap= in assemble command line - the option seems to be ignored. But it's honored when specified in config file. i think neil fixed this at some point -- i ran into it / reported essentially the same problems here a while ago. The thing is that when a external bitmap is being used for an array, and that bitmap resides on another filesystem, all common distributions fails to start/mount and to shutdown/umount arrays/filesystems properly, because all starts/stops is done in one script, and all mounts/umounts in another, but for bitmaps to work the two should be intermixed with each other. so i've got a debian unstable box which has uptime 402 days (to give you an idea how long ago i last tested the reboot sequence). it has raid1 root and raid5 /home. /home has an external bitmap on the root partition. i have /etc/default/mdadm set with INITRDSTART to start only the root raid1 during initrd... this manages to work out later when the external bitmap is required. but it is fragile... and i think it's only possible to get things to work with an initrd and the external bitmap on the root fs or by having custom initrd and/or rc.d scripts. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid array is not automatically detected.
On Mon, 16 Jul 2007, David Greaves wrote: Bryan Christ wrote: I do have the type set to 0xfd. Others have said that auto-assemble only works on RAID 0 and 1, but just as Justin mentioned, I too have another box with RAID5 that gets auto assembled by the kernel (also no initrd). I expected the same behavior when I built this array--again using mdadm instead of raidtools. Any md arrays with partition type 0xfd using a 0.9 superblock should be auto-assembled by a standard kernel. no... debian (and probably ubuntu) do not build md into the kernel, they build it as a module, and the module does not auto-detect 0xfd. i don't know anything about slackware, but i just felt it worth commenting that a standard kernel is not really descriptive enough. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS sunit/swidth for raid10
On Fri, 23 Mar 2007, Peter Rabbitson wrote: dean gaudet wrote: On Thu, 22 Mar 2007, Peter Rabbitson wrote: dean gaudet wrote: On Thu, 22 Mar 2007, Peter Rabbitson wrote: Hi, How does one determine the XFS sunit and swidth sizes for a software raid10 with 3 copies? mkfs.xfs uses the GET_ARRAY_INFO ioctl to get the data it needs from software raid and select an appropriate sunit/swidth... although i'm not sure i agree entirely with its choice for raid10: So do I, especially as it makes no checks for the amount of copies (3 in my case, not 2). it probably doesn't matter. This was essentially my question. For an array -pf3 -c1024 I get swidth = 4 * sunit = 4MiB. Is it about right and does it matter at all? how many drives? Sorry. 4 drives, 3 far copies (so any 2 drives can fail), 1M chunk. my mind continues to be blown by linux raid10. so that's like raid1 on 4 disks except the copies are offset by 1/4th of the disk? i think swidth = 4*sunit is the right config then -- 'cause a read of 4MiB will stride all 4 disks... -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS sunit/swidth for raid10
On Thu, 22 Mar 2007, Peter Rabbitson wrote: dean gaudet wrote: On Thu, 22 Mar 2007, Peter Rabbitson wrote: Hi, How does one determine the XFS sunit and swidth sizes for a software raid10 with 3 copies? mkfs.xfs uses the GET_ARRAY_INFO ioctl to get the data it needs from software raid and select an appropriate sunit/swidth... although i'm not sure i agree entirely with its choice for raid10: So do I, especially as it makes no checks for the amount of copies (3 in my case, not 2). it probably doesn't matter. This was essentially my question. For an array -pf3 -c1024 I get swidth = 4 * sunit = 4MiB. Is it about right and does it matter at all? how many drives? -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm: raid1 with ext3 - filesystem size differs?
it looks like you created the filesystem on the component device before creating the raid. -dean On Fri, 16 Mar 2007, Hanno Meyer-Thurow wrote: Hi all! Please CC me on answers since I am not subscribed to this list, thanks. When I try to build a raid1 system with mdadm 2.6.1 the filesystem size recorded in superblock differs from physical size of device. System: ana ~ # uname -a Linux ana 2.6.20-gentoo-r2 #4 SMP PREEMPT Sat Mar 10 16:25:46 CET 2007 x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux ana ~ # mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1 mdadm: /dev/sda1 appears to contain an ext2fs file system size=48152K mtime=Thu Mar 15 17:27:07 2007 mdadm: /dev/sda1 appears to be part of a raid array: level=raid1 devices=2 ctime=Thu Mar 15 17:25:52 2007 mdadm: /dev/sdb1 appears to contain an ext2fs file system size=48152K mtime=Thu Mar 15 17:27:07 2007 mdadm: /dev/sdb1 appears to be part of a raid array: level=raid1 devices=2 ctime=Thu Mar 15 17:25:52 2007 Continue creating array? y mdadm: array /dev/md1 started. ana ~ # cat /proc/mdstat md1 : active raid1 sdb1[1] sda1[0] 48064 blocks [2/2] [UU] ana ~ # mdadm --misc --detail /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Thu Mar 15 17:37:35 2007 Raid Level : raid1 Array Size : 48064 (46.95 MiB 49.22 MB) Used Dev Size : 48064 (46.95 MiB 49.22 MB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Thu Mar 15 17:38:27 2007 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : cf0478ee:7e60a40e:20a5e204:cc7bc2c9 Events : 0.4 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 8 171 active sync /dev/sdb1 ana ~ # LC_ALL=C fsck.ext3 /dev/md1 e2fsck 1.39 (29-May-2006) The filesystem size (according to the superblock) is 48152 blocks The physical size of the device is 48064 blocks Either the superblock or the partition table is likely to be corrupt! Aborty? yes Any ideas what could be wrong? Thank you in advance for help! Regards, Hanno - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Replace drive in RAID5 without losing redundancy?
On Tue, 6 Mar 2007, Neil Brown wrote: On Monday March 5, [EMAIL PROTECTED] wrote: Is it possible to mark a disk as to be replaced by an existing spare, then migrate to the spare disk and kick the old disk _after_ migration has been done? Or not even kick - but mark as new spare. No, this is not possible yet. You can get nearly all the way there by: - add an internal bitmap. - fail one drive - --build a raid1 with that drive (and the other missing) - re-add the raid1 into the raid5 - add the new drive to the raid1 - wait for resync i have an example at http://arctic.org/~dean/proactive-raid5-disk-replacement.txt... plus discussion as to why this isn't the best solution. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID Bitmap Question
On Mon, 26 Feb 2007, Neil Brown wrote: On Sunday February 25, [EMAIL PROTECTED] wrote: I believe Neil stated that using bitmaps does incur a 10% performance penalty. If one's box never (or rarely) crashes, is a bitmap needed? I think I said it can incur such a penalty. The actual cost is very dependant on work-load. i did a crude benchmark recently... to get some data for a common setup i use (external journals and bitmaps on raid1, xfs fs on raid5). emphasis on crude: time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync' xfs journal raid5 bitmaptimes internalnone0.18s user 2.14s system 2% cpu 1:27.95 total internalinternal0.16s user 2.16s system 1% cpu 2:01.12 total raid1 none0.07s user 2.02s system 2% cpu 1:20.62 total raid1 internal0.14s user 2.01s system 1% cpu 1:55.18 total raid1 raid1 0.14s user 2.03s system 2% cpu 1:20.61 total raid5: - 4x seagate 7200.10 400GB on marvell MV88SX6081 - mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1 raid1: - 2x maxtor 6Y200P0 on 3ware 7504 - two 128MiB partitions starting at cyl 1 - mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 /dev/sd[fg]1 - mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 /dev/sd[fg]2 - md1 is used for external xfs journal - md2 has an ext3 filesystem for the external md4 bitmap xfs: - mkfs.xfs issued before each run using the defaults (aside from -l logdev=/dev/md1) - mount -o noatime,nodiratime[,logdev=/dev/md1] system: - dual opteron 848 (2.2ghz), 8GiB ddr 266 - tyan s2882 - 2.6.20 -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reshaping raid0/10
On Thu, 22 Feb 2007, Neil Brown wrote: On Wednesday February 21, [EMAIL PROTECTED] wrote: Hello, are there any plans to support reshaping on raid0 and raid10? No concrete plans. It largely depends on time and motivation. I expect that the various flavours of raid5/raid6 reshape will come first. Then probably converting raid0-raid5. I really haven't given any thought to how you might reshape a raid10... i've got a 4x250 near2 i want to turn into a 4x750 near2. i was considering doing straight dd from each of the 250 to the respective 750 then doing an mdadm --create on the 750s (in the same ordering as the original array)... so i'd end up with a new array with more stripes. it seems like this should work. the same thing should work for all nearN with a multiple of N disks... and offsetN should work as well right? but farN sounds like a nightmare. if we had a generic proactive disk replacement method it could handle the 4x250-4x750 step. (i haven't decided yet if i want to try my hacky bitmap method of doing proactive replacement... i'm not sure what'll happen if i add a 750GB disk back into an array with 250s... i suppose it'll work... i'll have to experiment.) -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md autodetect only detects one disk in raid1
take a look at your mdadm.conf ... both on your root fs and in your initrd... look for a DEVICES line and make sure it says DEVICES partitions... anything else is likely to cause problems like below. also make sure each array is specified by UUID rather than device. and then rebuild your initrd. (dpkg-reconfigure linux-image-`uname -r` on debuntu). that something else in the system claim use of the device problem makes me guess you're on ubuntu pre-edgy... where for whatever reason they included evms in the default install and for whatever inane reason evms steals every damn device in the system when it starts up. uninstall/deactivate evms if you're not using it. -dean On Sat, 27 Jan 2007, kenneth johansson wrote: I run raid1 on my root partition /dev/md0. Now I had a bad disk so I had to replace it but did not notice until I got home that I got a SATA instead of a PATA. Since I had a free sata interface I just put in in that. I had no problem adding the disk to the raid1 device that is until I rebooted the computer. both the PATA disk and the SATA disk are detected before md start up the raid but only the PATA disk is activated. So the raid device is always booting in degraded mode. since this is the root disk I use the autodetect feature with partition type fd. Also Something else in the system claim use of the device since I can not add the SATA disk after the system has done a complete boot. I guess it has something to do with device mapper and LVM that I also run on the data disks but I'm not sure. any tip on what it can be?? If I add the SATA disk to md0 early enough in the boot it works but why is it not autodetected ? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bad performance on RAID 5
On Wed, 17 Jan 2007, Sevrin Robstad wrote: I'm suffering from bad performance on my RAID5. a echo check /sys/block/md0/md/sync_action gives a speed at only about 5000K/sec , and HIGH load average : # uptime 20:03:55 up 8 days, 19:55, 1 user, load average: 11.70, 4.04, 1.52 iostat -kx /dev/sd? 10 ... and sum up the total IO... also try increasing sync_speed_min/max and a loadavg jump like that suggests to me you have other things competing for the disk at the same time as the check. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 software vs hardware: parity calculations?
On Mon, 15 Jan 2007, Robin Bowes wrote: I'm running RAID6 instead of RAID5+1 - I've had a couple of instances where a drive has failed in a RAID5+1 array and a second has failed during the rebuild after the hot-spare had kicked in. if the failures were read errors without losing the entire disk (the typical case) then new kernels are much better -- on read error md will reconstruct the sectors from the other disks and attempt to write it back. you can also run monthly checks... echo check /sys/block/mdX/md/sync_action it'll read the entire array (parity included) and correct read errors as they're discovered. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 software vs hardware: parity calculations?
On Mon, 15 Jan 2007, berk walker wrote: dean gaudet wrote: echo check /sys/block/mdX/md/sync_action it'll read the entire array (parity included) and correct read errors as they're discovered. Could I get a pointer as to how I can do this check in my FC5 [BLAG] system? I can find no appropriate check, nor md available to me. It would be a good thing if I were able to find potentially weak spots, rewrite them to good, and know that it might be time for a new drive. All of my arrays have drives of approx the same mfg date, so the possibility of more than one showing bad at the same time can not be ignored. it should just be: echo check /sys/block/mdX/md/sync_action if you don't have a /sys/block/mdX/md/sync_action file then your kernel is too old... or you don't have /sys mounted... (or you didn't replace X with the raid number :) iirc there were kernel versions which had the sync_action file but didn't yet support the check action (i think possibly even as recent as 2.6.17 had a small bug initiating one of the sync_actions but i forget which one). if you can upgrade to 2.6.18.x it should work. debian unstable (and i presume etch) will do this for all your arrays automatically once a month. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 software vs hardware: parity calculations?
On Mon, 15 Jan 2007, Mr. James W. Laferriere wrote: Hello Dean , On Mon, 15 Jan 2007, dean gaudet wrote: ...snip... it should just be: echo check /sys/block/mdX/md/sync_action if you don't have a /sys/block/mdX/md/sync_action file then your kernel is too old... or you don't have /sys mounted... (or you didn't replace X with the raid number :) iirc there were kernel versions which had the sync_action file but didn't yet support the check action (i think possibly even as recent as 2.6.17 had a small bug initiating one of the sync_actions but i forget which one). if you can upgrade to 2.6.18.x it should work. debian unstable (and i presume etch) will do this for all your arrays automatically once a month. -dean Being able to run a 'check' is a good thing (tm) . But without a method to acquire statii data back from the check , Seems rather bland . Is there a tool/file to poll/... where data statii can be acquired ? i'm not 100% certain what you mean, but i generally just monitor dmesg for the md read error message (mind you the message pre-2.6.19 or .20 isn't very informative but it's obvious enough). there is also a file mismatch_cnt in the same directory as sync_action ... the Documentation/md.txt (in 2.6.18) refers to it incorrectly as mismatch_count... but anyhow why don't i just repaste the relevant portion of md.txt. -dean ... Active md devices for levels that support data redundancy (1,4,5,6) also have sync_action a text file that can be used to monitor and control the rebuild process. It contains one word which can be one of: resync- redundancy is being recalculated after unclean shutdown or creation recover - a hot spare is being built to replace a failed/missing device idle - nothing is happening check - A full check of redundancy was requested and is happening. This reads all block and checks them. A repair may also happen for some raid levels. repair- A full check and repair is happening. This is similar to 'resync', but was requested by the user, and the write-intent bitmap is NOT used to optimise the process. This file is writable, and each of the strings that could be read are meaningful for writing. 'idle' will stop an active resync/recovery etc. There is no guarantee that another resync/recovery may not be automatically started again, though some event will be needed to trigger this. 'resync' or 'recovery' can be used to restart the corresponding operation if it was stopped with 'idle'. 'check' and 'repair' will start the appropriate process providing the current state is 'idle'. mismatch_count When performing 'check' and 'repair', and possibly when performing 'resync', md will count the number of errors that are found. The count in 'mismatch_cnt' is the number of sectors that were re-written, or (for 'check') would have been re-written. As most raid levels work in units of pages rather than sectors, this my be larger than the number of actual errors by a factor of the number of sectors in a page. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 software vs hardware: parity calculations?
On Sat, 13 Jan 2007, Robin Bowes wrote: Bill Davidsen wrote: There have been several recent threads on the list regarding software RAID-5 performance. The reference might be updated to reflect the poor write performance of RAID-5 until/unless significant tuning is done. Read that as tuning obscure parameters and throwing a lot of memory into stripe cache. The reasons for hardware RAID should include performance of RAID-5 writes is usually much better than software RAID-5 with default tuning. Could you point me at a source of documentation describing how to perform such tuning? Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port SATA card configured as a single RAID6 array (~3TB available space) linux sw raid6 small write performance is bad because it reads the entire stripe, merges the small write, and writes back the changed disks. unlike raid5 where a small write can get away with a partial stripe read (i.e. the smallest raid5 write will read the target disk, read the parity, write the target, and write the updated parity)... afaik this optimization hasn't been implemented in raid6 yet. depending on your use model you might want to go with raid5+spare. benchmark if you're not sure. for raid5/6 i always recommend experimenting with moving your fs journal to a raid1 device instead (on separate spindles -- such as your root disks). if this is for a database or fs requiring lots of small writes then raid5/6 are generally a mistake... raid10 is the only way to get performance. (hw raid5/6 with nvram support can help a bit in this area, but you just can't beat raid10 if you need lots of writes/s.) beyond those config choices you'll want to become friendly with /sys/block and all the myriad of subdirectories and options under there. in particular: /sys/block/*/queue/scheduler /sys/block/*/queue/read_ahead_kb /sys/block/*/queue/nr_requests /sys/block/mdX/md/stripe_cache_size for * = any of the component disks or the mdX itself... some systems have an /etc/sysfs.conf you can place these settings in to have them take effect on reboot. (sysfsutils package on debuntu) -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 software vs hardware: parity calculations?
On Thu, 11 Jan 2007, James Ralston wrote: I'm having a discussion with a coworker concerning the cost of md's raid5 implementation versus hardware raid5 implementations. Specifically, he states: The performance [of raid5 in hardware] is so much better with the write-back caching on the card and the offload of the parity, it seems to me that the minor increase in work of having to upgrade the firmware if there's a buggy one is a highly acceptable trade-off to the increased performance. The md driver still commits you to longer run queues since IO calls to disk, parity calculator and the subsequent kflushd operations are non-interruptible in the CPU. A RAID card with write-back cache releases the IO operation virtually instantaneously. It would seem that his comments have merit, as there appears to be work underway to move stripe operations outside of the spinlock: http://lwn.net/Articles/184102/ What I'm curious about is this: for real-world situations, how much does this matter? In other words, how hard do you have to push md raid5 before doing dedicated hardware raid5 becomes a real win? hardware with battery backed write cache is going to beat the software at small write traffic latency essentially all the time but it's got nothing to do with the parity computation. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a RAID1--superblock problems
On Tue, 12 Dec 2006, Jonathan Terhorst wrote: I need to shrink a RAID1 array and am having trouble with the persistent superblock; namely, mdadm --grow doesn't seem to relocate it. If I downsize the array and then shrink the corresponding partitions, the array fails since the superblock (which is normally located near the end of the device) now lays outside of the partitions. Is there any easier way to deal with this than digging into the mdadm source, manually calculating the superblock offset and dd'ing it to the right spot? i'd think it'd be easier to recreate the array using --assume-clean after the shrink. for raid1 it's extra easy because you don't need to get the disk ordering correct. in fact with raid1 you don't even need to use mdadm --grow... you could do something like the following (assuming you've already shrunk the filesystem): mdadm --stop /dev/md0 mdadm --zero-superblock /dev/sda1 mdadm --zero-superblock /dev/sdb1 fdisk /dev/sda ... shrink partition fdisk /dev/sdb ... shrink partition mdadm --create --assume-clean --level=1 -n2 /dev/md0 /dev/sd[ab]1 heck that same technique works for raid0/4/5/6 and raid10 near and offset layouts as well, doesn't it? raid10 far layout definitely needs blocks rearranged to shrink. in these other modes you'd need to be careful about recreating the array with the correct ordering of disks. the zero-superblock step is an important defense against future problems with assemble every array i find-types of initrds that are unfortunately becomming common (i.e. debian and ubuntu). -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Observations of a failing disk
On Tue, 28 Nov 2006, Richard Scobie wrote: Anyway, my biggest concern is why echo repair /sys/block/md5/md/sync_action appeared to have no effect at all, when I understand that it should re-write unreadable sectors? i've had the same thing happen on a seagate 7200.8 pata 400GB... and went through the same sequence of operations you described, and the dd fixed it. one theory was that i lucked out and the pending sectors in the unused disk near the md superblock... but since that's in general only about 90KB of disk i was kind of skeptical. it's certainly possible, but seems unlikely. another theory is that a pending sector doesn't always result in a read error -- i.e. depending on temperature? but the question is, why wouldn't the disk try rewriting it if it does get a successful read. i wish hard drives were a little less voodoo. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid 1 (non) performance
On Wed, 15 Nov 2006, Magnus Naeslund(k) wrote: # cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sda3[0] sdb3[1] 236725696 blocks [2/2] [UU] md1 : active raid1 sda2[0] sdb2[1] 4192896 blocks [2/2] [UU] md0 : active raid1 sda1[0] sdb1[1] 4192832 blocks [2/2] [UU] i see you have split /var and / on the same spindle... if your /home is on / then you're causing extra seek action by having two active filesystems on the same spindles. another option to consider is to make / small and mostly read-only and move /home to /var/home (and use a symlink or mount --bind to place it at /home). or just put everything in one big / filesystem. hopefully your swap isn't being used much anyhow. try iostat -kx /dev/sd* 5 and see if the split is causing you troubles -- i/o activity on more than one partition at once. I've tried to modify the queuing by doing this, to disable the write cache and enable CFQ. The CFQ choice is rather random. for disk in sda sdb; do blktool /dev/$disk wcache off hdparm -q -W 0 /dev/$disk turning off write caching is a recipe for disasterous performance on most ata disks... unfortunately. better to buy a UPS and set up nut or apcupsd or something to handle shutdown. or just take your chances. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: safest way to swap in a new physical disk
On Tue, 14 Nov 2006, Will Sheffler wrote: Hi. What is the safest way to switch out a disk in a software raid array created with mdadm? I'm not talking about replacing a failed disk, I want to take a healthy disk in the array and swap it for another physical disk. Specifically, I have an array made up of 10 250gb software-raid partitions on 8 300gb disks and 2 250gb disks, plus a hot spare. I want to switch the 250s to new 300gb disks so everything matches. Is there a way to do this without risking a rebuild? I can't back everything up, so I want to be as risk-free as possible. I guess what I want is to do something like this: (1) Unmount the array (2) Un-create the array (3) Somehow exactly duplicate partition X to a partition Y on a new disk (4) Re-create array with X gone and Y in it's place (5) Check if the array is OK without changing/activating it (6) If there is a problem, switch from Y back to X and have it as though nothing changed The part I'm worried about is (3), as I've tried duplicating partition images before and it never works right. Is there a way to do this with mdadm? if you have a recent enough kernel (2.6.15 i think) and recent enough mdadm (2.2.x i think) you can do this all online without losing redundancy for more than a few seconds... i placed a copy of instructions and further discussions of what types of problems this method has here: http://arctic.org/~dean/proactive-raid5-disk-replacement.txt it's actually perfect for your situation. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 hang on get_active_stripe
and i haven't seen it either... neil do you think your latest patch was hiding the bug? 'cause there was an iteration of an earlier patch which didn't produce much spam in dmesg but the bug was still there, then there is the version below which spams dmesg a fair amount but i didn't see the bug in ~30 days. btw i've upgraded that box to 2.6.18.2 without the patch (it had some conflicts)... haven't seen the bug yet though (~10 days so far). hmm i wonder if i could reproduce it more rapidly if i lowered /sys/block/mdX/md/stripe_cache_size. i'll give that a go. -dean On Tue, 14 Nov 2006, Chris Allen wrote: You probably guessed that no matter what I did, I never, ever saw the problem when your trace was installed. I'd guess at some obscure timing-related problem. I can still trigger it consistently with a vanilla 2.6.17_SMP though, but again only when bitmaps are turned on. Neil Brown wrote: On Tuesday October 10, [EMAIL PROTECTED] wrote: Very happy to. Let me know what you'd like me to do. Cool thanks. At the end is a patch against 2.6.17.11, though it should apply against any later 2.6.17 kernel. Apply this and reboot. Then run while true do cat /sys/block/mdX/md/stripe_cache_active sleep 10 done /dev/null (maybe write a little script or whatever). Leave this running. It effects the check for has raid5 hung. Make sure to change mdX to whatever is appropriate. Occasionally look in the kernel logs for plug problem: if you find that, send me the surrounding text - there should be about a dozen lines following this one. Hopefully this will let me know which is last thing to happen: a plug or an unplug. If the last is a plug, then the timer really should still be pending, but isn't (this is impossible). So I'll look more closely at that option. If the last is an unplug, then the 'Plugged' flag should really be clear but it isn't (this is impossible). So I'll look more closely at that option. Dean is running this, but he only gets the hang every couple of weeks. If you get it more often, that would help me a lot. Thanks, NeilBrown diff ./.patches/orig/block/ll_rw_blk.c ./block/ll_rw_blk.c --- ./.patches/orig/block/ll_rw_blk.c 2006-08-21 09:52:46.0 +1000 +++ ./block/ll_rw_blk.c 2006-10-05 11:33:32.0 +1000 @@ -1546,6 +1546,7 @@ static int ll_merge_requests_fn(request_ * This is called with interrupts off and no requests on the queue and * with the queue lock held. */ +static atomic_t seq = ATOMIC_INIT(0); void blk_plug_device(request_queue_t *q) { WARN_ON(!irqs_disabled()); @@ -1558,9 +1559,16 @@ void blk_plug_device(request_queue_t *q) return; if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, q-queue_flags)) { + q-last_plug = jiffies; + q-plug_seq = atomic_read(seq); + atomic_inc(seq); mod_timer(q-unplug_timer, jiffies + q-unplug_delay); blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG); - } + } else + q-last_plug_skip = jiffies; + if (!timer_pending(q-unplug_timer) + !q-unplug_work.pending) + printk(Neither Timer or work are pending\n); } EXPORT_SYMBOL(blk_plug_device); @@ -1573,10 +1581,17 @@ int blk_remove_plug(request_queue_t *q) { WARN_ON(!irqs_disabled()); - if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, q-queue_flags)) + if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, q-queue_flags)) { + q-last_unplug_skip = jiffies; return 0; + } del_timer(q-unplug_timer); + q-last_unplug = jiffies; + q-unplug_seq = atomic_read(seq); + atomic_inc(seq); + if (test_bit(QUEUE_FLAG_PLUGGED, q-queue_flags)) + printk(queue still (or again) plugged\n); return 1; } @@ -1635,7 +1650,7 @@ static void blk_backing_dev_unplug(struc static void blk_unplug_work(void *data) { request_queue_t *q = data; - + q-last_unplug_work = jiffies; blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, q-rq.count[READ] + q-rq.count[WRITE]); @@ -1649,6 +1664,7 @@ static void blk_unplug_timeout(unsigned blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL, q-rq.count[READ] + q-rq.count[WRITE]); + q-last_unplug_timeout = jiffies; kblockd_schedule_work(q-unplug_work); } diff ./.patches/orig/drivers/md/raid1.c ./drivers/md/raid1.c --- ./.patches/orig/drivers/md/raid1.c 2006-08-10 17:28:01.0 +1000 +++ ./drivers/md/raid1.c2006-09-04 21:58:31.0 +1000 @@ -1486,7 +1486,6 @@ static void raid1d(mddev_t *mddev) d = conf-raid_disks; d--; rdev =
Re: RAID5 array showing as degraded after motherboard replacement
On Wed, 8 Nov 2006, James Lee wrote: However I'm still seeing the error messages in my dmesg (the ones I posted earlier), and they suggest that there is some kind of hardware fault (based on a quick Google of the error codes). So I'm a little confused. the fact that the error is in a geometry command really makes me wonder... did you compare the number of blocks on the device vs. what seems to be available when it's on the weird raid card? -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is my RAID broken?
On Mon, 6 Nov 2006, Mikael Abrahamsson wrote: On Mon, 6 Nov 2006, Neil Brown wrote: So it looks like you machine recently crashed (power failure?) and it is restarting. Or upgrade some part of the OS and now it'll do resync every week or so (I think this is debian default nowadays, don't know the interval though). it should be only once a month... and it's just a check -- it reads everything and corrects errors. i think it's a great thing actually... way more useful than smart long self-tests because md can reconstruct read errors immediately -- before you lose redundancy in that stripe. -dean % cat /etc/cron.d/mdadm # # cron.d/mdadm -- schedules periodic redundancy checks of MD devices # # Copyright © martin f. krafft [EMAIL PROTECTED] # distributed under the terms of the Artistic Licence 2.0 # # $Id: mdadm.cron.d 147 2006-08-30 09:26:11Z madduck $ # # By default, run at 01:06 on every Sunday, but do nothing unless the day of # the month is less than or equal to 7. Thus, only run on the first Sunday of # each month. crontab(5) sucks, unfortunately, in this regard; therefore this # hack (see #380425). 6 1 * * 0 root [ -x /usr/share/mdadm/checkarray ] [ $(date +\%d) -le 7 ] /usr/share/mdadm/checkarray --cron --all --quiet
Re: mdadm 2.5.5 external bitmap assemble problem
On Mon, 6 Nov 2006, Neil Brown wrote: hey i have another related question... external bitmaps seem to pose a bit of a chicken-and-egg problem. all of my filesystems are md devices. with an external bitmap i need at least one of the arrays to start, then have filesystems mounted, then have more arrays start... it just happens to work OK if i let debian unstale initramfs try to start all my arrays, it'll fail for the ones needing bitmap. then later /etc/init.d/mdadm-raid should start the array. (well it would if the bitmap= in mdadm.conf worked :) is it possible to put bitmaps on devices instead of files? mdadm seems to want a --force for that (because the device node exists already) and i haven't tried forcing it. although i suppose a 200KB partition would be kind of tiny but i could place the bitmap right beside the external transaction log for the filesystem on the raid5. Create the root filesystem with --bitmap=internal, and store all the other bitmaps on that filesystem maybe? yeah i only have the one external bitmap (it's for a large raid5)... so things will work fine once i apply your patch. thanks. I don't know if it would work to have a bitmap on a device, but you can always mkfs the device, mount it, and put a bitmap on a file there?? yeah this was the first thing i tried after i found mdadm -b /dev/foo wasn't accepted... without modifying startup scripts there's no way to use any filesystem other than root... it's just due to ordering of init scripts: # ls /etc/rcS.d | grep -i 'mount\|raid' S02mountkernfs.sh S04mountdevsubfs.sh S25mdadm-raid S35mountall.sh S36mountall-bootclean.sh S45mountnfs.sh S46mountnfs-bootclean.sh i'd need to run another mdadm-raid after the S35mountall, and then another mountall. anyhow, i don't think you need to change anything (except maybe a note in the docs somewhere), i'm just bringing it up as part of the experience of trying external bitmap. i suspect that in the wild and crazy direction debian and ubuntu are heading (ditching sysvinit for event-driven systems) it'll be easy to express the boot dependencies. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 array showing as degraded after motherboard replacement
On Mon, 6 Nov 2006, James Lee wrote: Thanks for the reply Dean. I looked through dmesg output from the boot up, to check whether this was just an ordering issue during the system start up (since both evms and mdadm attempt to activate the array, which could cause things to go wrong...). Looking through the dmesg output though, it looks like the 'missing' disk is being detected before the array is assembled, but that the disk is throwing up errors. I've attached the full output of dmesg; grepping it for hde gives the following: [17179574.084000] ide2: BM-DMA at 0xd400-0xd407, BIOS settings: hde:DMA, hdf:DMA [17179574.38] hde: NetCell SyncRAID(TM) SR5000 JBOD, ATA DISK drive [17179575.312000] hde: max request size: 512KiB [17179575.312000] hde: 625134827 sectors (320069 MB), CHS=38912/255/63, (U)DMA [17179575.312000] hde: set_geometry_intr: status=0x51 { DriveReady SeekComplete Error } [17179575.312000] hde: set_geometry_intr: error=0x04 { DriveStatusError } [17179575.312000] hde: cache flushes supported is it possible that the NetCell SyncRAID implementation is stealing some of the sectors (even though it's marked JBOD)? anyhow it could be the disk is bad, but i'd still be tempted to see if the problem stays with the controller if you swap the disk with another in the array. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
On Mon, 6 Nov 2006, Neil Brown wrote: This creates a deep disconnect between udev and md. udev expects a device to appear first, then it created the device-special-file in /dev. md expect the device-special-file to exist first, and then created the device on the first open. could you create a special /dev/mdx device which is used to assemble/create arrays only? i mean literally mdx not mdX where X is a number. mdx would always be there if md module is loaded... so udev would see the driver appear and then create the /dev/mdx. then mdadm would use /dev/mdx to do assemble/creates/whatever and cause other devices to appear/disappear in a manner which udev is happy with. (much like how /dev/ptmx is used to create /dev/pts/N entries.) doesn't help legacy mdadm binaries... but seems like it fits the New World Order. or hm i suppose the New World Order is to eschew binary interfaces and suggest a /sys/class/md/ hierarchy with a bunch of files you have to splat ascii data into to cause an array to be created/assembled. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Checking individual drive state
On Sun, 5 Nov 2006, Bradshaw wrote: I've recently built a smallish RAID5 box as a storage area for my home network, using mdadm. However, one of the drives will not remain in the array for longer that around two days before it is removed. Readding it to the array does not throw any errors, leading me to believe that it's probably a problem with the controller, which is an add-in SATA card, as well as the other drive connected to it failing once. I don't know how to scan the one disk for bad sectors, stopping the array and doing an fsck or similar throws errors, so I need help in determining whether the disc itself is faulty. try swapping the cable first. after that swap ports with another disk and see if the problem follows the port or the disk. you can see if smartctl -a (from smartmontools) tells you anything interesting. (it can be quite difficult, to impossible, to understand smartctl -a output though. but if you've got errors in the SMART error log that's a good place to start.) If the controller is to be replaced, how would I go about migrating the two discs to the new controller whilst maintaining the array? it depends on which method you're using to assemble the array at boot time. in most cases if these aren't your root disks then a swap of two disks won't result in any troubles reassembling the array. other device renames may cause problems depending on your distribution though -- but generally when two devices swap names within an array you should be fine. you'll want to do the disk swap with the array offline (either shutdown the box or mdadm --stop the array). -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 array showing as degraded after motherboard replacement
On Sun, 5 Nov 2006, James Lee wrote: Hi there, I'm running a 5-drive software RAID5 array across two controllers. The motherboard in that PC recently died - I sent the board back for RMA. When I refitted the motherboard, connected up all the drives, and booted up I found that the array was being reported as degraded (though all the data on it is intact). I have 4 drives on the on board controller and 1 drive on an XFX Revo 64 SATA controller card. The drive which is being reported as not being in the array is the one connected to the XFX controller. The OS can see that drive fine, and mdadm --examine on that drive shows that it is part of the array and that there are 5 active devices in the array. Doing mdadm --examine on one of the other four drives shows that the array has 4 active drives and one failed. mdadm --detail for the array also shows 4 active and one failed. that means the array was assembled without the 5th disk and is currently degraded. Now I haven't lost any data here and I know I can just force a resync of the array which is fine. However I'm concerned about how this has happened. One worry is that the XFX SATA controller is doing something funny to the drive. I've noticed that it's BIOS has defaulted to RAID0 mode (even though there's only one drive on it) - I can't see how this would cause any particular problems here though. I guess it's possible that some data on the drive got corrupted when the motherboard failed... no it's more likely the devices were renamed or the 5th device didn't come up before the array was assembled. it's possible that a different bios setting lead to the device using a different driver than is in your initrd... but i'm just guessing. Any ideas what could cause mdadm to report as I've described above (I've attached the output of these three commands)? I'm running Ubuntu Edgy, which is a 2.17.x kernel, and mdadm 2.4.1. In case it's relevant here, I created the array using EVMS... i've never created an array with evms... but my guess is that it may have used mapped device names instead of the normal device names. take a look at /proc/mdstat and see what devices are in the array and use those as a template to find the name of the missing device. below i'll use /dev/sde1 as the example missing device and /dev/md0 as the example array. first thing i'd try is something like this: mdadm /dev/md0 -a /dev/sde1 which hotadds the device into the array... which will start a resync. when the resync is done (cat /proc/mdstat) do this. mdadm -Gb internal /dev/md0 which will add write-intent bitmaps to your device... which will avoid another long wait for a resync after the next reboot if the fix below doesn't help. then do this: dpkg-reconfigure linux-image-`uname -r` which will rebuild the initrd for your kernel ... and if it was a driver change this should include the new driver into the initrd. then reboot and see if it comes up fine. if it doesn't, you can repeat the -a /dev/sde1 command above... the resync will be quick this time due to the bitmap... and we'll have to investigate further. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mdadm 2.5.5 external bitmap assemble problem
i think i've got my mdadm.conf set properly for an external bitmap -- but it doesn't seem to work. i can assemble from the command-line fine though: # grep md4 /etc/mdadm/mdadm.conf ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc # mdadm -A /dev/md4 mdadm: Could not open bitmap file # mdadm -A --uuid=dbc3be0b:b5853930:a02e038c:13ba8cdc --bitmap=/bitmap.md4 /dev/md4 mdadm: /dev/md4 has been started with 5 drives and 1 spare. # mdadm --version mdadm - v2.5.5 - 23 October 2006 (this is on debian unstale) btw -- mdadm seems to create the bitmap file with world readable perms. i doubt it matters, but 600 would seem like a better mode. hey i have another related question... external bitmaps seem to pose a bit of a chicken-and-egg problem. all of my filesystems are md devices. with an external bitmap i need at least one of the arrays to start, then have filesystems mounted, then have more arrays start... it just happens to work OK if i let debian unstale initramfs try to start all my arrays, it'll fail for the ones needing bitmap. then later /etc/init.d/mdadm-raid should start the array. (well it would if the bitmap= in mdadm.conf worked :) is it possible to put bitmaps on devices instead of files? mdadm seems to want a --force for that (because the device node exists already) and i haven't tried forcing it. although i suppose a 200KB partition would be kind of tiny but i could place the bitmap right beside the external transaction log for the filesystem on the raid5. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5/10 chunk size and ext2/3 stride parameter
On Sat, 4 Nov 2006, martin f krafft wrote: also sprach dean gaudet [EMAIL PROTECTED] [2006.11.03.2019 +0100]: I cannot find authoritative information about the relation between the RAID chunk size and the correct stride parameter to use when creating an ext2/3 filesystem. you know, it's interesting -- mkfs.xfs somehow gets the right sunit/swidth automatically from the underlying md device. i don't know enough about xfs to be able to agree or disagree with you on that. # mdadm --create --level=5 --raid-devices=4 --assume-clean --auto=yes /dev/md0 /dev/sd[abcd]1 mdadm: array /dev/md0 started. with 64k chunks i assume... yup. # mkfs.xfs /dev/md0 meta-data=/dev/md0 isize=256agcount=32, agsize=9157232 blks = sectsz=4096 attr=0 data = bsize=4096 blocks=293031424, imaxpct=25 = sunit=16 swidth=48 blks, unwritten=1 sunit seems like the stride width i determined (64k chunks / 4k bzise), but what is swidth? Is it 64 * 3/4 because of the four device RAID5? yup. and for a raid6 mkfs.xfs correctly gets sunit=16 swidth=32. # mdadm --create --level=10 --layout=f2 --raid-devices=4 --assume-clean --auto=yes /dev/md0 /dev/sd[abcd]1 mdadm: array /dev/md0 started. # mkfs.xfs -f /dev/md0 meta-data=/dev/md0 isize=256agcount=32, agsize=6104816 blks = sectsz=512 attr=0 data = bsize=4096 blocks=195354112, imaxpct=25 = sunit=16 swidth=64 blks, unwritten=1 okay, so as before, 16 stride size and 64 stripe width, because we're now dealing with mirrors. # mdadm --create --level=10 --layout=n2 --raid-devices=4 --assume-clean --auto=yes /dev/md0 /dev/sd[abcd]1 mdadm: array /dev/md0 started. # mkfs.xfs -f /dev/md0 meta-data=/dev/md0 isize=256agcount=32, agsize=6104816 blks = sectsz=512 attr=0 data = bsize=4096 blocks=195354112, imaxpct=25 = sunit=16 swidth=64 blks, unwritten=1 why not? in this case, -n2 and -f2 aren't any different, are they? they're different in that with f2 you get essentially 4 disk raid0 read performance because the copies of each byte are half a disk away... so it looks like a raid0 on the first half of the disks, and another raid0 on the second half. in n2 the two copies are at the same offset... so it looks more like a 2 disk raid0 for reading and writing. i'm not 100% certain what xfs uses them for -- you can actually change the values at mount time. so it probably uses them for either read scheduling or write layout or both. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5/10 chunk size and ext2/3 stride parameter
On Tue, 24 Oct 2006, martin f krafft wrote: Hi, I cannot find authoritative information about the relation between the RAID chunk size and the correct stride parameter to use when creating an ext2/3 filesystem. you know, it's interesting -- mkfs.xfs somehow gets the right sunit/swidth automatically from the underlying md device. for example, on a box i'm testing: # mdadm --create --level=5 --raid-devices=4 --assume-clean --auto=yes /dev/md0 /dev/sd[abcd]1 mdadm: array /dev/md0 started. # mkfs.xfs /dev/md0 meta-data=/dev/md0 isize=256agcount=32, agsize=9157232 blks = sectsz=4096 attr=0 data = bsize=4096 blocks=293031424, imaxpct=25 = sunit=16 swidth=48 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=32768, version=2 = sectsz=4096 sunit=1 blks realtime =none extsz=196608 blocks=0, rtextents=0 # mdadm --stop /dev/md0 mdadm: stopped /dev/md0 # mdadm --zero-superblock /dev/sd[abcd]1 # mdadm --create --level=10 --layout=f2 --raid-devices=4 --assume-clean --auto=yes /dev/md0 /dev/sd[abcd]1 mdadm: array /dev/md0 started. # mkfs.xfs -f /dev/md0 meta-data=/dev/md0 isize=256agcount=32, agsize=6104816 blks = sectsz=512 attr=0 data = bsize=4096 blocks=195354112, imaxpct=25 = sunit=16 swidth=64 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=262144 blocks=0, rtextents=0 i wonder if the code could be copied into mkfs.ext3? although hmm, i don't think it gets raid10 n2 correct: # mdadm --stop /dev/md0 mdadm: stopped /dev/md0 # mdadm --zero-superblock /dev/sd[abcd]1 # mdadm --create --level=10 --layout=n2 --raid-devices=4 --assume-clean --auto=yes /dev/md0 /dev/sd[abcd]1 mdadm: array /dev/md0 started. # mkfs.xfs -f /dev/md0 meta-data=/dev/md0 isize=256agcount=32, agsize=6104816 blks = sectsz=512 attr=0 data = bsize=4096 blocks=195354112, imaxpct=25 = sunit=16 swidth=64 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks realtime =none extsz=262144 blocks=0, rtextents=0 in a near 2 layout i would expect sunit=16, swidth=32 ... but swidth=64 probably doesn't hurt. My understanding is that (block * stride) == (chunk). So if I create a default RAID5/10 with 64k chunks, and create a filesystem with 4k blocks on it, I should choose stride 64k/4k = 16. that's how i think it works -- i don't think ext[23] have a concept of stripe width like xfs does. they just want to know how to avoid putting all the critical data on one disk (which needs only the chunk size). but you should probably ask on the linux-ext4 mailing list. Is the chunk size of an array equal to the stripe size? Or is it (n-1)*chunk size for RAID5 and (n/2)*chunk size for a plain near=2 RAID10? Also, I understand that it makes no sense to use stride for RAID1 as there are no stripes in that sense. But for RAID10 it makes sense, right? yep. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md array numbering is messed up
On Mon, 30 Oct 2006, Brad Campbell wrote: Michael Tokarev wrote: My guess is that it's using mdrun shell script - the same as on Debian. It's a long story, the thing is quite ugly and messy and does messy things too, but they says it's compatibility stuff and continue shipping it. ... I'd suggest you are probably correct. By default on Ubuntu 6.06 [EMAIL PROTECTED]:~$ cat /etc/init.d/mdadm-raid #!/bin/sh # # Start any arrays which are described in /etc/mdadm/mdadm.conf and which are # not running already. # # Copyright (c) 2001-2004 Mario Jou/3en [EMAIL PROTECTED] # Distributable under the terms of the GNU GPL version 2. MDADM=/sbin/mdadm MDRUN=/sbin/mdrun fwiw mdrun is finally on its way out. the debian unstable mdadm package is full of new goodness (initramfs goodness, 2.5.x mdadm featurefulness, monthly full array check goodness). ubuntu folks should copy it again before they finalize edgy. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: why partition arrays?
On Tue, 24 Oct 2006, Bill Davidsen wrote: My read on LVM is that (a) it's one more thing for the admin to learn, (b) because it's seldom used the admin will be working from documentation if it has a problem, and (c) there is no bug-free software, therefore the use of LVM on top of RAID will be less reliable than a RAID-only solution. I can't quantify that, the net effect may be too small to measure. However, the cost and chance of a finger check from (a) and (b) are significant. this is essentially why i gave up on LVM as well. add in the following tidbits: - snapshots stopped working in 2.6. may be fixed by now, but i gave up hope and this was the biggest feature i desired from LVM. - it's way better for performance to have only one active filesystem on a group of spindles - you can emulate pvmove with md superblockless raid1 sufficiently well for most purposes (although as we've discussed here it would be nice if md directly supported proactive replacement) and more i'm forgetting. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB and raid... Device names change
On Tue, 19 Sep 2006, Eduardo Jacob wrote: DEVICE /dev/raid111 /dev/raid121 ARRAY /dev/md0 level=raid1 num-devices=2 UUID=1369e13f:eb4fa45c:6d4b9c2a:8196aa1b try using DEVICE partitions... then mdadm -As /dev/md0 will scan all available partitions for raid components with UUID=1369e13f:eb4fa45c:6d4b9c2a:8196aa1b. so it won't matter which sdX they are. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: access *existing* array from knoppix
On Tue, 12 Sep 2006, Dexter Filmore wrote: Am Dienstag, 12. September 2006 16:08 schrieb Justin Piszcz: /dev/MAKEDEV /dev/md0 also make sure the SW raid modules etc are loaded if necessary. Won't work, MAKEDEV doesn't know how to create [/dev/]md0. echo 'DEVICE partitions' /tmp/mdadm.conf mdadm --detail --scan --config=/tmp/mdadm.conf /tmp/mdadm.conf take a look in /tmp/mdadm.conf ... your root array should be listed. mdadm --assemble --config=/tmp/mdadm.conf --auto=yes /dev/md0 -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UUID's
On Sat, 9 Sep 2006, Richard Scobie wrote: To remove all doubt about what is assembled where, I though going to: DEVICE partitions MAILADDR root ARRAY /dev/md3 UUID=xyz etc. would be more secure. Is this correct thinking on my part? yup. mdadm can generate it all for you... there's an example on the man page. basically you just want to paste the output of mdadm --detail --scan --config=partitions into your mdadm.conf. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Care and feeding of RAID?
On Tue, 5 Sep 2006, Paul Waldo wrote: What about bitmaps? Nobody has mentioned them. It is my understanding that you just turn them on with mdadm /dev/mdX -b internal. Any caveats for this? bitmaps have been working great for me on a raid5 and raid1. it makes it that much more tolerable when i accidentally crash the box and don't have to wait forever for a resync. i don't notice the extra write traffic all that much... under heavy traffic i see about 3 writes/s to the spare disk in the raid5 -- i assume those are all due to the bitmap in the superblock on the spare. i've considered using an external bitmap, i forget why i didn't do that initially. the filesystem on the raid5 already has an external journal on raid1. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: proactive-raid-disk-replacement
On Fri, 8 Sep 2006, Michael Tokarev wrote: Recently Dean Gaudet, in thread titled 'Feature Request/Suggestion - Drive Linking', mentioned his document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt I've read it, and have some umm.. concerns. Here's why: mdadm -Gb internal --bitmap-chunk=1024 /dev/md4 mdadm /dev/md4 -r /dev/sdh1 mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1 mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing mdadm /dev/md4 --re-add /dev/md5 mdadm /dev/md5 -a /dev/sdh1 ... wait a few hours for md5 resync... And here's the problem. While new disk, sdh1, are resynced from old, probably failing disk sde1, chances are high that there will be an unreadable block on sde1. And this means the whole thing will not work -- md5 initially contained one working drive (sde1) and one spare (sdh1) which is being converted (resynced) to working disk. But after read error on sde1, md5 will contain one failed drive and one spare -- for raid1 it's fatal combination. While at the same time, it's perfectly easy to reconstruct this failing block from other component devices of md4. this statement is an argument for native support for this type of activity in md itself. That to say: this way of replacing disk in a software raid array isn't much better than just removing old drive and adding new one. hmm... i'm not sure i agree. in your proposal you're guaranteed to have no redundancy while you wait for the new disk to sync in the raid5. in my proposal the probability that you'll retain redundancy through the entire process is non-zero. we can debate how non-zero it is, but non-zero is greater than zero. i'll admit it depends a heck of a lot on how long you wait to replace your disks, but i prefer to replace mine well before they get to the point where just reading the entire disk is guaranteed to result in problems. And if the drive you're replacing is failing (according to SMART for example), this method is more likely to fail. my practice is to run regular SMART long self tests, which tend to find Current_Pending_Sectors (which are generally read errors waiting to happen) and then launch a repair sync action... that generally drops the Current_Pending_Sector back to zero. either through a realloc or just simply rewriting the block. if it's a realloc then i consider if there's enough of them to warrant replacing the disk... so for me the chances of a read error while doing the raid1 thing aren't as high as they could be... but yeah you've convinced me this solution isn't good enough. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UUID's
On Sat, 9 Sep 2006, Richard Scobie wrote: If I have specified an array in mdadm.conf using UUID's: ARRAY /dev/md0 UUID=3aaa0122:29827cfa:5331ad66:ca767371 and I replace a failed drive in the array, will the new drive be given the previous UUID, or do I need to upate the mdadm.conf entry? once you do the mdadm /dev/mdX -a /dev/newdrive the new drive will have the UUID. no need to update the mdadm.conf for the UUID... however if you're using DEVICE foo where foo is not partitions then you should make sure foo includes the new drive. (DEVICE partitions is recommended.) -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Feature Request/Suggestion - Drive Linking
On Mon, 4 Sep 2006, Bill Davidsen wrote: But I think most of the logic exists, the hardest part would be deciding what to do. The existing code looks as if it could be hooked to do this far more easily than writing new. In fact, several suggested recovery schemes involve stopping the RAID5, replacing the failing drive with a created RAID1, etc. So the method is valid, it would just be nice to have it happen without human intervention. you don't actually have to stop the raid5 if you're using bitmaps... you can just remove the disk, create a (superblockless) raid1 and put the raid1 back in place. the whole process could be handled a lot like mdadm handles spare groups already... there isn't a lot more kernel support required. the largest problem is if a power failure occurs before the process finishes. i'm 95% certain that even during a reconstruction, raid1 writes go to all copies even if the write is beyond the current sync position[1] -- so the raid5 superblock would definitely have been written to the partial disk... so that means on a reboot there'll be two disks which look like they're both the same (valid) component of the raid5, and one of them definitely isn't. maybe there's some trick to handle this situation -- aside from ensuring the array won't come up automatically on reboot until after the process has finished. one way to handle it would be to have an option for raid1 resync which suppresses writes which are beyond the resync position... then you could zero the new disk superblock to start with, and then start up the resync -- then it won't have a valid superblock until the entire disk is copied. -dean [1] there's normally a really good reason for raid1 to mirror all writes even if they're beyond the resync point... consider the case where you have a system crash and have 2 essentially idential mirrors which then need a resync... and the source disk dies during the resync. if all writes have been mirrored then the other disk is already useable (in fact it's essentially arbitrary which of the mirrors was used for the resync source after the crash -- they're all equally (un)likely to have the most current data)... without bitmaps this sort of thing is a common scenario and certainly saved my data more than once. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID-5 recovery
On Sun, 3 Sep 2006, Clive Messer wrote: This leads me to a question. I understand from reading the linux-raid archives that the current behaviour when rebuilding with a single badblock on another disk is for that disk to also be kicked from the array. that's not quite the current behaviour. since 2.6.14 or .15 or so md will reconstruct bad blocks from other disks and try writing them. it's only when this fails repeatedly that it knocks the disk out of the array. -dean -- VGER BF report: H 0.347442 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resize on dirty array?
On Sun, 13 Aug 2006, dean gaudet wrote: On Fri, 11 Aug 2006, David Rees wrote: On 8/11/06, dean gaudet [EMAIL PROTECTED] wrote: On Fri, 11 Aug 2006, David Rees wrote: On 8/10/06, dean gaudet [EMAIL PROTECTED] wrote: - set up smartd to run long self tests once a month. (stagger it every few days so that your disks aren't doing self-tests at the same time) I personally prefer to do a long self-test once a week, a month seems like a lot of time for something to go wrong. unfortunately i found some drives (seagate 400 pata) had a rather negative effect on performance while doing self-test. Interesting that you noted negative performance, but I typically schedule the tests for off-hours anyway where performance isn't critical. How much of a performance hit did you notice? i never benchmarked it explicitly. iirc the problem was generally metadata performance... and became less of an issue when i moved the filesystem log off the raid5 onto a raid1. unfortunately there aren't really any off hours for this system. the problem reappeared... so i can provide some data. one of the 400GB seagates has been stuck at 20% of a SMART long self test for over 2 days now, and the self-test itself has been going for about 4.5 days total. a typical iostat -x /dev/sd[cdfgh] 30 sample looks like this: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdc 90.94 137.52 14.70 25.76 841.32 1360.3554.43 0.94 23.30 10.30 41.68 sdd 93.67 140.52 14.96 22.06 863.98 1354.7559.93 0.91 24.50 12.17 45.05 sdf 92.84 136.85 15.36 26.39 857.85 1360.3553.13 0.88 21.04 10.59 44.21 sdg 87.74 137.82 14.23 24.86 807.73 1355.5555.35 0.85 21.86 11.25 43.99 sdh 87.20 134.56 14.96 28.29 810.13 1356.8850.10 1.90 43.72 20.02 86.60 those 5 are in a raid5, so their io should be relatively even... notice the await, svctm and %util of sdh compared to the other 4. sdh is the one with the exceptionally slow going SMART long self-test. i assume it's still making progress because the effect is measurable in iostat. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Feature Request/Suggestion - Drive Linking
On Wed, 30 Aug 2006, Neil Bortnak wrote: Hi Everybody, I had this major recovery last week after a hardware failure monkeyed things up pretty badly. About half way though I had a couple of ideas and I thought I'd suggest/ask them. 1) Drive Linking: So let's say I have a 6 disk RAID5 array and I have reason to believe one of the drives will fail (funny noises, SMART warnings or it's *really* slow compared to the other drives, etc). It would be nice to put in a new drive, link it to the failing disk so that it copies all of the data to the new one and mirrors new writes as they happen. http://arctic.org/~dean/proactive-raid5-disk-replacement.txt works for any raid level actually. 2) This sort of brings up a subject I'm getting increasingly paranoid about. It seems to me that if disk 1 develops a unrecoverable error at block 500 and disk 4 develops one at 55,000 I'm going to get a double disk failure as soon as one of the bad blocks is read (or some other system problem -makes it look like- some random block is unrecoverable). Such an error should not bring the whole thing to a crashing halt. I know I can recover from that sort of error manually, but yuk. Neil made some improvements in this area as of 2.6.15... when md gets a read error it won't knock the entire drive out immediately -- it first attempts to reconstruct the sectors from the other drives and write them back. this covers a lot of the failure cases because the drive will either successfully complete the write in-place, or use its reallocation pool. the kernel logs when it makes such a correction (but the log wasn't very informative until 2.6.18ish i think). if you watch SMART data (either through smartd logging changes for you, or if you diff the output regularly) you can see this activity happen as well. you can also use the check/repair sync_actions to force this to happen when you know a disk has a Current_Pending_Sector (i.e. pending read error). -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is mdadm --create safe for existing arrays ?
On Wed, 16 Aug 2006, Peter Greis wrote: So, how do I change / and /boot to make the super blocks persistent ? Is it safe to run mdadm --create /dev/md0 --raid-devices=2 --level=1 /dev/sda1 /dev/sdb1 without loosing any data ? boot a rescue disk shrink the filesystems by a few MB to accomodate the superblock mdadm --create /dev/md0 --raid-devices=2 --level=1 /dev/sda1 missing mdadm /dev/md0 -a /dev/sdb1 grow the filesystem you could probably get away with an --assume-clean and no resync if you know the array is clean... just don't forget to shrink/grow the filesystem. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resize on dirty array?
On Fri, 11 Aug 2006, David Rees wrote: On 8/11/06, dean gaudet [EMAIL PROTECTED] wrote: On Fri, 11 Aug 2006, David Rees wrote: On 8/10/06, dean gaudet [EMAIL PROTECTED] wrote: - set up smartd to run long self tests once a month. (stagger it every few days so that your disks aren't doing self-tests at the same time) I personally prefer to do a long self-test once a week, a month seems like a lot of time for something to go wrong. unfortunately i found some drives (seagate 400 pata) had a rather negative effect on performance while doing self-test. Interesting that you noted negative performance, but I typically schedule the tests for off-hours anyway where performance isn't critical. How much of a performance hit did you notice? i never benchmarked it explicitly. iirc the problem was generally metadata performance... and became less of an issue when i moved the filesystem log off the raid5 onto a raid1. unfortunately there aren't really any off hours for this system. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resize on dirty array?
suggestions: - set up smartd to run long self tests once a month. (stagger it every few days so that your disks aren't doing self-tests at the same time) - run 2.6.15 or later so md supports repairing read errors from the other drives... - run 2.6.16 or later so you get the check and repair sync_actions in /sys/block/mdX/md/sync_action (i think 2.6.16.x still has a bug where you have to echo a random word other than repair to sync_action to get a repair to start... wrong sense on a strcmp, fixed in 2.6.17). - run nightly diffs of smartctl -a output on all your drives so you see when one of them reports problems in the smart self test or otherwise has a Current_Pending_Sectors or Realloc event... then launch a repair sync_action. - proactively replace your disks every couple years (i prefer to replace busy disks before 3 years). -dean On Wed, 9 Aug 2006, James Peverill wrote: In this case the raid WAS the backup... however it seems it turned out to be less reliable than the single disks it was supporting. In the future I think I'll make sure my disks have varying ages so they don't fail all at once. James RAID is no excuse for backups. PS: ctrlpgup - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still can't get md arrays that were started from an initrd to shutdown
On Mon, 17 Jul 2006, Christian Pernegger wrote: The problem seems to affect only arrays that are started via an initrd, even if they do not have the root filesystem on them. That's all arrays if they're either managed by EVMS or the ramdisk-creator is initramfs-tools. For yaird-generated initrds only the array with root on it is affected. with lvm you have to stop lvm before you can stop the arrays... i wouldn't be surprised if evms has the same issue... of course this *should* happen cleanly on shutdown assuming evms is also being shutdown... but maybe that gives you something to look for. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: proactive raid5 disk replacement success (using bitmap + raid1)
well that part is optional... i wasn't replacing the disk right away anyhow -- it had just exhibited its first surface error during SMART and i thought i'd try moving the data elsewhere just for the experience of it. -dean On Thu, 22 Jun 2006, Ming Zhang wrote: Hi Dean Thanks a lot for sharing this. I am not quite understand about these 2 commands. Why we want to add a pre-failing disk back to md4? mdadm --zero-superblock /dev/sde1 mdadm /dev/md4 -a /dev/sde1 Ming On Sun, 2006-04-23 at 18:40 -0700, dean gaudet wrote: i had a disk in a raid5 which i wanted to clone onto the hot spare... without going offline and without long periods without redundancy. a few folks have discussed using bitmaps and temporary (superblockless) raid1 mappings to do this... i'm not sure anyone has tried / reported success though. this is my success report. setup info: - kernel version 2.6.16.9 (as packaged by debian) - mdadm version 2.4.1 - /dev/md4 is the raid5 - /dev/sde1 is the disk in md4 i want to clone from - /dev/sdh1 is the hot spare from md4, and is the clone target - /dev/md5 is an unused md device name here are the exact commands i issued: mdadm -Gb internal --bitmap-chunk=1024 /dev/md4 mdadm /dev/md4 -r /dev/sdh1 mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1 mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing mdadm /dev/md4 --re-add /dev/md5 mdadm /dev/md5 -a /dev/sdh1 ... wait a few hours for md5 resync... mdadm /dev/md4 -f /dev/md5 -r /dev/md5 mdadm --stop /dev/md5 mdadm /dev/md4 --re-add /dev/sdh1 mdadm --zero-superblock /dev/sde1 mdadm /dev/md4 -a /dev/sde1 this sort of thing shouldn't be hard to script :) the only times i was without full redundancy was briefly between the -r and --re-add commands... and with bitmap support the raid5 resync for each of those --re-adds was essentially zero. thanks Neil (and others)! -dean p.s. it's absolutely necessary to use --build for the temporary raid1 ... if you use --create mdadm will rightfully tell you it's already a raid component and if you --force it then you'll trash the raid5 superblock and it won't fit into the raid5 any more... - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 hang on get_active_stripe
On Wed, 31 May 2006, Neil Brown wrote: On Tuesday May 30, [EMAIL PROTECTED] wrote: actually i think the rate is higher... i'm not sure why, but klogd doesn't seem to keep up with it: [EMAIL PROTECTED]:~# grep -c kblockd_schedule_work /var/log/messages 31 [EMAIL PROTECTED]:~# dmesg | grep -c kblockd_schedule_work 8192 # grep 'last message repeated' /var/log/messages ?? um hi, of course :) the paste below is approximately correct. -dean [EMAIL PROTECTED]:~# egrep 'kblockd_schedule_work|last message repeated' /var/log/messages May 30 17:05:09 localhost kernel: kblockd_schedule_work failed May 30 17:05:59 localhost kernel: kblockd_schedule_work failed May 30 17:08:16 localhost kernel: kblockd_schedule_work failed May 30 17:10:51 localhost kernel: kblockd_schedule_work failed May 30 17:11:51 localhost kernel: kblockd_schedule_work failed May 30 17:12:46 localhost kernel: kblockd_schedule_work failed May 30 17:12:56 localhost last message repeated 22 times May 30 17:14:14 localhost kernel: kblockd_schedule_work failed May 30 17:16:57 localhost kernel: kblockd_schedule_work failed May 30 17:17:00 localhost last message repeated 83 times May 30 17:17:02 localhost kernel: kblockd_schedule_work failed May 30 17:17:33 localhost last message repeated 950 times May 30 17:18:34 localhost last message repeated 2218 times May 30 17:19:35 localhost last message repeated 1581 times May 30 17:20:01 localhost last message repeated 579 times May 30 17:20:02 localhost kernel: kblockd_schedule_work failed May 30 17:20:02 localhost kernel: kblockd_schedule_work failed May 30 17:20:02 localhost kernel: kblockd_schedule_work failed May 30 17:20:02 localhost last message repeated 23 times May 30 17:20:03 localhost kernel: kblockd_schedule_work failed May 30 17:20:34 localhost last message repeated 1058 times May 30 17:21:35 localhost last message repeated 2171 times May 30 17:22:36 localhost last message repeated 2305 times May 30 17:23:37 localhost last message repeated 2311 times May 30 17:24:38 localhost last message repeated 1993 times May 30 17:25:01 localhost last message repeated 702 times May 30 17:25:02 localhost kernel: kblockd_schedule_work failed May 30 17:25:02 localhost last message repeated 15 times May 30 17:25:02 localhost kernel: kblockd_schedule_work failed May 30 17:25:02 localhost last message repeated 12 times May 30 17:25:03 localhost kernel: kblockd_schedule_work failed May 30 17:25:34 localhost last message repeated 1061 times May 30 17:26:35 localhost last message repeated 2009 times May 30 17:27:36 localhost last message repeated 1941 times May 30 17:28:37 localhost last message repeated 2345 times May 30 17:29:38 localhost last message repeated 2367 times May 30 17:30:01 localhost last message repeated 870 times May 30 17:30:01 localhost kernel: kblockd_schedule_work failed May 30 17:30:01 localhost last message repeated 45 times May 30 17:30:02 localhost kernel: kblockd_schedule_work failed May 30 17:30:33 localhost last message repeated 1180 times May 30 17:31:34 localhost last message repeated 2062 times May 30 17:32:34 localhost last message repeated 2277 times May 30 17:32:36 localhost kernel: kblockd_schedule_work failed May 30 17:33:07 localhost last message repeated 1114 times May 30 17:34:08 localhost last message repeated 2308 times May 30 17:35:01 localhost last message repeated 1941 times May 30 17:35:01 localhost kernel: kblockd_schedule_work failed May 30 17:35:02 localhost last message repeated 20 times May 30 17:35:02 localhost kernel: kblockd_schedule_work failed May 30 17:35:33 localhost last message repeated 1051 times May 30 17:36:34 localhost last message repeated 2002 times May 30 17:37:35 localhost last message repeated 1644 times May 30 17:38:36 localhost last message repeated 1731 times May 30 17:39:37 localhost last message repeated 1844 times May 30 17:40:01 localhost last message repeated 817 times May 30 17:40:02 localhost kernel: kblockd_schedule_work failed May 30 17:40:02 localhost last message repeated 39 times May 30 17:40:02 localhost kernel: kblockd_schedule_work failed May 30 17:40:02 localhost last message repeated 12 times May 30 17:40:03 localhost kernel: kblockd_schedule_work failed May 30 17:40:34 localhost last message repeated 1051 times May 30 17:41:35 localhost last message repeated 1576 times May 30 17:42:36 localhost last message repeated 2000 times May 30 17:43:37 localhost last message repeated 2058 times May 30 17:44:15 localhost last message repeated 1337 times May 30 17:44:15 localhost kernel: kblockd_schedule_work failed May 30 17:44:46 localhost last message repeated 1016 times May 30 17:45:01 localhost last message repeated 432 times May 30 17:45:02 localhost kernel: kblockd_schedule_work failed May 30 17:45:02 localhost kernel: kblockd_schedule_work failed May 30 17:45:33 localhost last message repeated 1229 times May 30 17:46:34 localhost last message repeated 2552 times May 30 17:47:36 localhost last message repeated
Re: raid5 hang on get_active_stripe
On Sun, 28 May 2006, Neil Brown wrote: The following patch adds some more tracing to raid5, and might fix a subtle bug in ll_rw_blk, though it is an incredible long shot that this could be affecting raid5 (if it is, I'll have to assume there is another bug somewhere). It certainly doesn't break ll_rw_blk. Whether it actually fixes something I'm not sure. If you could try with these on top of the previous patches I'd really appreciate it. When you read from /stripe_cache_active, it should trigger a (cryptic) kernel message within the next 15 seconds. If I could get the contents of that file and the kernel messages, that should help. got the hang again... attached is the dmesg with the cryptic messages. i didn't think to grab the task dump this time though. hope there's a clue in this one :) but send me another patch if you need more data. -dean neemlark:/sys/block/md4/md# cat stripe_cache_size 256 neemlark:/sys/block/md4/md# cat stripe_cache_active 251 0 preread plugged bitlist=0 delaylist=251 neemlark:/sys/block/md4/md# cat stripe_cache_active 251 0 preread plugged bitlist=0 delaylist=251 neemlark:/sys/block/md4/md# echo 512 stripe_cache_size neemlark:/sys/block/md4/md# cat stripe_cache_active 512 292 preread not plugged bitlist=0 delaylist=32 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 292 preread not plugged bitlist=0 delaylist=32 neemlark:/sys/block/md4/md# cat stripe_cache_active 445 0 preread not plugged bitlist=0 delaylist=73 neemlark:/sys/block/md4/md# cat stripe_cache_active 480 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 413 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 13 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 493 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 487 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 405 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 1 preread not plugged bitlist=0 delaylist=28 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 84 preread not plugged bitlist=0 delaylist=69 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 69 preread not plugged bitlist=0 delaylist=56 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 41 preread not plugged bitlist=0 delaylist=38 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 10 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 453 3 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 480 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 14 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 477 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 476 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 486 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 480 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 384 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 387 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 462 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 480 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 448 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 501 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 476 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 416 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 386 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 434 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 406 0 preread not plugged bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 447 0 preread not plugged bitlist=0
Re: [PATCH] mdadm 2.5 (Was: ANNOUNCE: mdadm 2.5 - A tool for managing Soft RAID under Linux)
On Sun, 28 May 2006, Luca Berra wrote: - mdadm-2.5-rand.patch Posix dictates rand() versus bsd random() function, and dietlibc deprecated random(), so switch to srand()/rand() and make everybody happy. fwiw... lots of rand()s tend to suck... and RAND_MAX may not be large enough for this use. glibc rand() is the same as random(). do you know if dietlibc's rand() is good enough? -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mdadm 2.5 (Was: ANNOUNCE: mdadm 2.5 - A tool for managing Soft RAID under Linux)
On Sun, 28 May 2006, Luca Berra wrote: dietlibc rand() and random() are the same function. but random will throw a warning saying it is deprecated. that's terribly obnoxious... it's never going to be deprecated, there are only approximately a bazillion programs using random(). -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 hang on get_active_stripe
On Tue, 23 May 2006, Neil Brown wrote: I've spent all morning looking at this and while I cannot see what is happening I did find a couple of small bugs, so that is good... I've attached three patches. The first fix two small bugs (I think). The last adds some extra information to /sys/block/mdX/md/stripe_cache_active They are against 2.6.16.11. If you could apply them and if the problem recurs, report the content of stripe_cache_active several times before and after changing it, just like you did last time, that might help throw some light on the situation. i applied them against 2.6.16.18 and two days later i got my first hang... below is the stripe_cache foo. thanks -dean neemlark:~# cd /sys/block/md4/md/ neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_size 256 neemlark:/sys/block/md4/md# echo 512 stripe_cache_size neemlark:/sys/block/md4/md# cat stripe_cache_active 474 187 preread bitlist=0 delaylist=222 neemlark:/sys/block/md4/md# cat stripe_cache_active 438 222 preread bitlist=0 delaylist=72 neemlark:/sys/block/md4/md# cat stripe_cache_active 438 222 preread bitlist=0 delaylist=72 neemlark:/sys/block/md4/md# cat stripe_cache_active 469 222 preread bitlist=0 delaylist=72 neemlark:/sys/block/md4/md# cat stripe_cache_active 512 72 preread bitlist=160 delaylist=103 neemlark:/sys/block/md4/md# cat stripe_cache_active 1 0 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 2 0 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 0 0 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# cat stripe_cache_active 2 0 preread bitlist=0 delaylist=0 neemlark:/sys/block/md4/md# md4 : active raid5 sdd1[0] sde1[5](S) sdh1[4] sdg1[3] sdf1[2] sdc1[1] 1562834944 blocks level 5, 128k chunk, algorithm 2 [5/5] [U] bitmap: 10/187 pages [40KB], 1024KB chunk - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 hang on get_active_stripe
On Sat, 27 May 2006, Neil Brown wrote: On Friday May 26, [EMAIL PROTECTED] wrote: On Tue, 23 May 2006, Neil Brown wrote: i applied them against 2.6.16.18 and two days later i got my first hang... below is the stripe_cache foo. thanks -dean neemlark:~# cd /sys/block/md4/md/ neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 Thanks. This narrows it down quite a bit... too much infact: I can now say for sure that this cannot possible happen :-) heheh. fwiw the box has traditionally been rock solid.. it's ancient though... dual p3 750 w/440bx chipset and pc100 ecc memory... 3ware 7508 w/seagate 400GB disks... i really don't suspect the hardware all that much because the freeze seems to be rather consistent as to time of day (overnight while i've got 3x rdiff-backup, plus bittorrent, plus updatedb going). unfortunately it doesn't happen every time... but every time i've unstuck the box i've noticed those processes going. other tidbits... md4 is a lvm2 PV ... there are two LVs, one with ext3 and one with xfs. Two things that might be helpful: 1/ Do you have any other patches on 2.6.16.18 other than the 3 I sent you? If you do I'd like to see them, just in case. it was just 2.6.16.18 plus the 3 you sent... i attached the .config (it's rather full -- based off debian kernel .config). maybe there's a compiler bug: gcc version 4.0.4 20060507 (prerelease) (Debian 4.0.3-3) 2/ The message.gz you sent earlier with the echo t /proc/sysrq-trigger trace in it didn't contain information about md4_raid5 - the controlling thread for that array. It must have missed out due to a buffer overflowing. Next time it happens, could you to get this trace again and see if you can find out what what md4_raid5 is going. Maybe do the 'echo t' several times. I think that you need a kernel recompile to make the dmesg buffer larger. ok i'll set CONFIG_LOG_BUF_SHIFT=18 and rebuild ... note that i'm going to include two more patches in this next kernel: http://lkml.org/lkml/2006/5/23/42 http://arctic.org/~dean/patches/linux-2.6.16.5-no-treason.patch the first was the Jens Axboe patch you mentioned here recently (for accounting with i/o barriers)... and the second gets rid of the tcp treason uncloaked messages. Thanks for your patience - this must be very frustrating for you. fortunately i'm the primary user of this box... and the bug doesn't corrupt anything... and i can unstick it easily :) so it's not all that frustrating actually. -dean config.gz Description: Binary data
Re: raid5 hang on get_active_stripe
On Thu, 11 May 2006, dean gaudet wrote: On Tue, 14 Mar 2006, Neil Brown wrote: On Monday March 13, [EMAIL PROTECTED] wrote: I just experienced some kind of lockup accessing my 8-drive raid5 (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but now processes that try to read the md device hang. ps tells me they are all sleeping in get_active_stripe. There is nothing in the syslog, and I can read from the individual drives fine with dd. mdadm says the state is active. ... i seem to be running into this as well... it has happenned several times in the past three weeks. i attached the kernel log output... it happenned again... same system as before... You could try increasing the size of the stripe cache echo 512 /sys/block/mdX/md/stripe_cache_size (choose and appropriate 'X'). yeah that got things going again -- it took a minute or so maybe, i wasn't paying attention as to how fast things cleared up. i tried 768 this time and it wasn't enough... 1024 did it again... Maybe check the content of /sys/block/mdX/md/stripe_cache_active as well. next time i'll check this before i increase stripe_cache_size... it's 0 now, but the raid5 is working again... here's a sequence of things i did... not sure if it helps: # cat /sys/block/md4/md/stripe_cache_active 435 # cat /sys/block/md4/md/stripe_cache_size 512 # echo 768 /sys/block/md4/md/stripe_cache_size # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # cat /sys/block/md4/md/stripe_cache_active 752 # echo 1024 /sys/block/md4/md/stripe_cache_size # cat /sys/block/md4/md/stripe_cache_active 927 # cat /sys/block/md4/md/stripe_cache_active 151 # cat /sys/block/md4/md/stripe_cache_active 66 # cat /sys/block/md4/md/stripe_cache_active 2 # cat /sys/block/md4/md/stripe_cache_active 1 # cat /sys/block/md4/md/stripe_cache_active 0 # cat /sys/block/md4/md/stripe_cache_active 3 and it's OK again... except i'm going to lower the stripe_cache_size to 256 again because i'm not sure i want to keep having to double it each freeze :) let me know if you want the task dump output from this one too. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
proactive raid5 disk replacement success (using bitmap + raid1)
i had a disk in a raid5 which i wanted to clone onto the hot spare... without going offline and without long periods without redundancy. a few folks have discussed using bitmaps and temporary (superblockless) raid1 mappings to do this... i'm not sure anyone has tried / reported success though. this is my success report. setup info: - kernel version 2.6.16.9 (as packaged by debian) - mdadm version 2.4.1 - /dev/md4 is the raid5 - /dev/sde1 is the disk in md4 i want to clone from - /dev/sdh1 is the hot spare from md4, and is the clone target - /dev/md5 is an unused md device name here are the exact commands i issued: mdadm -Gb internal --bitmap-chunk=1024 /dev/md4 mdadm /dev/md4 -r /dev/sdh1 mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1 mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing mdadm /dev/md4 --re-add /dev/md5 mdadm /dev/md5 -a /dev/sdh1 ... wait a few hours for md5 resync... mdadm /dev/md4 -f /dev/md5 -r /dev/md5 mdadm --stop /dev/md5 mdadm /dev/md4 --re-add /dev/sdh1 mdadm --zero-superblock /dev/sde1 mdadm /dev/md4 -a /dev/sde1 this sort of thing shouldn't be hard to script :) the only times i was without full redundancy was briefly between the -r and --re-add commands... and with bitmap support the raid5 resync for each of those --re-adds was essentially zero. thanks Neil (and others)! -dean p.s. it's absolutely necessary to use --build for the temporary raid1 ... if you use --create mdadm will rightfully tell you it's already a raid component and if you --force it then you'll trash the raid5 superblock and it won't fit into the raid5 any more... - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11
On Mon, 10 Apr 2006, Marc L. de Bruin wrote: dean gaudet wrote: On Mon, 10 Apr 2006, Marc L. de Bruin wrote: However, all preferred minors are correct, meaning that the output is in sync with what I expected it to be from /etc/mdadm/mdadm.conf. Any other ideas? Just adding /etc/mdadm/mdadm.conf to the initrd does not seem to work, since mdrun seems to ignore it?! it seems to me mdrun /dev is about the worst thing possible to use in an initrd. :-) I guess I'll have to change to yaird asap then. I can't think of any other solid solution... yeah i've been yaird... it's not perfect -- take a look at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=351183 for a patch i use to improve the ability of a yaird initrd booting when you've moved devices or a device has failed. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
forcing a read on a known bad block
hey Neil... i've been wanting to test out the reconstruct-on-read-error code... and i've had two chances to do so, but haven't be able to force md to read the appropriate block to trigger the code. i had two disks with SMART Current_Pending_Sector 0 (which indicates pending read error) and i did SMART long self-tests to find out where the bad block was (it should show the LBA in the SMART error log)... one disk was in a raid1 -- and so it was kind of random which of the two disks would be read from if i tried to seek to that LBA and read... in theory with O_DIRECT i should have been able to randomly get the right disk, but that seems a bit clunky. unfortunately i didn't think of the O_DIRECT trick until after i'd given up and decided to just resync the whole disk proactively. the other disk was in a raid5 ... 5 disk raid5, so 20% chance of the bad block being in parity. i copied the kernel code to be sure, and sure enough the bad block was in parity... just bad luck :) so i can't force a read there any way that i know of... anyhow this made me wonder if there's some other existing trick to force such reads/reconstructions to occur... or perhaps this might be a useful future feature. on the raid5 disk i actually tried reading the LBA directly from the component device and it didn't trigger the read error, so now i'm a bit skeptical of the SMART log and/or my computation of the seek offset in the partition... but the above question is still interesting. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11
On Mon, 10 Apr 2006, Marc L. de Bruin wrote: dean gaudet wrote: initramfs-tools generates an mdrun /dev which starts all the raids it can find... but does not include the mdadm.conf in the initrd so i'm not sure it will necessarily start them in the right minor devices. try doing an mdadm --examine /dev/xxx on some of your partitions to see if the preferred minor is what you expect it to be... [EMAIL PROTECTED]:~# sudo mdadm --examine /dev/md[01234] try running it on /dev/sda1 or whatever the component devices are for your array... not on the array devices. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11
On Mon, 10 Apr 2006, Marc L. de Bruin wrote: However, all preferred minors are correct, meaning that the output is in sync with what I expected it to be from /etc/mdadm/mdadm.conf. Any other ideas? Just adding /etc/mdadm/mdadm.conf to the initrd does not seem to work, since mdrun seems to ignore it?! yeah it looks like mdrun /dev just seems to assign things in the order they're discovered without consulting the preferred minor. it seems to me mdrun /dev is about the worst thing possible to use in an initrd. i opened a bug yesterday http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=361674 ... it seems really they should stop using mdrun entirely... when i get a chance i'll try updating the bug (or go ahead and add your own experiences to it). oh hey take a look at this bug for debian mdadm package http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=354705 ... he intends to deprecate mdrun. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md/mdadm fails to properly run on 2.6.15 after upgrading from 2.6.11
On Sun, 9 Apr 2006, Marc L. de Bruin wrote: ... Okay, just pressing Control-D continues the boot process and AFAIK the root filesystemen actually isn't corrupt. Running e2fsck returns no errors and booting 2.6.11 works just fine, but I have no clue why it picked the wrong partitions to build md[01234]. What could have happened here? i didn't know sarge had 2.6.11 or 2.6.15 packages... but i'm going to assume you've installed one of initramfs-tools or yaird in order to use the unstable 2.6.11 or 2.6.15 packages... so my comments might not apply. initramfs-tools generates an mdrun /dev which starts all the raids it can find... but does not include the mdadm.conf in the initrd so i'm not sure it will necessarily start them in the right minor devices. try doing an mdadm --examine /dev/xxx on some of your partitions to see if the preferred minor is what you expect it to be... if the preferred minors are wrong there's some mdadm incantation to update them... see the man page. or switch to yaird (you'll have to install yaird and purge initramfs-tools) and dpkg-reconfigure your kernel packages to cause the initrds to be rebuilt. yaird starts only the raid required for the root filesystem, and specifies the correct minor for it. then later after the initrd /etc/init.d/mdadm-raid will start the rest of your raids using your mdadm.conf. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 high cpu usage during reads - oprofile results
On Sat, 1 Apr 2006, Alex Izvorski wrote: Dean - I think I see what you mean, you're looking at this line in the assembly? 65830 16.8830 : c1f: cmp%rcx,0x28(%rax) yup that's the one... that's probably a fair number of cache (or tlb) misses going on right there. I looked at the hash stuff, I think the problem is not that the hash function is poor, but rather that the number of entries in all buckets gets to be pretty high. yeah... your analysis seems more likely. i suppose increasing the number of buckets is the only option. it looks to me like you'd just need to change NR_HASH and the kzalloc in run() in order to increase the number of buckets. i'm guessing there's a good reason for STRIPE_SIZE being 4KiB -- 'cause otherwise it'd be cool to run with STRIPE_SIZE the same as your raid chunksize... which would decrease the number of entries -- much more desirable than increasing the number of buckets. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 that used parity for reads only when degraded
On Thu, 23 Mar 2006, Alex Izvorski wrote: Also the cpu load is measured with Andrew Morton's cyclesoak tool which I believe to be quite accurate. there's something cyclesoak does which i'm not sure i agree with: cyclesoak process dirties an array of 100 bytes... so what you're really getting is some sort of composite measurement of memory system utilisation and cpu cycle availability. i think that 1MB number was chosen before 1MiB caches were common... and what you get during calibration is a L2 cache-hot loop, but i'm not sure that's an important number. i'd look at what happens if you increase cyclesoak.c busyloop_size to 8MB ... and decrease it to 128. the two extremes are going to weight the cpu load towards measuring available memory system bandwidth and available cpu cycles. also for calibration consider using a larger -p n ... especially if you've got any cpufreq/powernowd setup which is varying your clock rates... you want to be sure that it's calibrated (and measured) at a fixed clock rate. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: naming of md devices
On Thu, 23 Mar 2006, Nix wrote: Last I heard the Debian initramfs constructs RAID arrays by explicitly specifying the devices that make them up. This is, um, a bad idea: the first time a disk fails or your kernel renumbers them you're in *trouble*. yaird seems to dtrt ... at least in unstable. if you install yaird instead of initramfs-tools you get stuff like this in the initrd /init: mknod /dev/md3 b 9 3 mdadm -Ac partitions /dev/md3 --uuid 2b3a5b77:c7b4ab81:a2b8322a:db5c4e88 initramfs-tools also appears to do something which should work... but i haven't tested it... it basically runs mdrun /dev without specifying a minor/uuid for the root, so it'll start all arrays... i'm afraid that might mess up for one of my arrays which is auto=mdp... and has the annoying property of starting arrays on disks you've moved from other systems. so anyhow i lean towards yaird at the moment... (and i should submit some bug reports i guess). the above is on unstable... i don't use stable (and stable definitely does the wrong thing -- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=338200). -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to clone a disk
On Sat, 11 Mar 2006, Ming Zhang wrote: On Sat, 2006-03-11 at 06:53 -0500, Paul M. wrote: Since its raid5 you would be fine just pulling the disk out and letting the raid driver rebuild the array. If you have a hot spare yes, rebuilding is the simplest way. but rebuild will need to read all other disks and write to the new disk. when serving some io at same time, the rebuilding speed is not much, but if i do a dd clone and plug it back. the total traffic is copy one disk which can be done very fast as a fully sequential workload. with that bitmap feature, the rsync work after plugging back is minor. so the one disk fail window is pretty small here. right? you're planning to do this while the array is online? that's not safe... unless it's a read-only array... if you've got a bitmap then one thing you *could* do is stop the array temporarily, and copy the bitmap first, then restart the array... then copy the rest of the disk minus the bitmap. you basically need an atomic copy of the bitmap from before you start the ddrescue... and you need to use that copy of the bitmap when you reassemble the array with the new disk. or you could stop the raid5, and make a raid1 (legacy style, without raid superblock) of the dying disk and the new disk... then reassemble the raid5 using the raid1 for the one component... then restart the raid5. regardless of which method you use you're going to need to take the array offline at least once to reassemble it with the duplicated disk in place of the dying disk... i think i'd be tempted to do the raid1 method ... because that one requires you go offline at most once -- after the raid1 syncs you can fail out the dying drive and leave the raid1 around degraded until some future system maintenance event where you can reassemble without it. (a reboot would automagically make it disappear too -- because it wouldn't have a raid1 superblock anyhow). -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to clone a disk
On Sat, 11 Mar 2006, Ming Zhang wrote: On Sat, 2006-03-11 at 16:15 -0800, dean gaudet wrote: you're planning to do this while the array is online? that's not safe... unless it's a read-only array... what i plan to do is to pull out the disk (which is ok now but going to die), so raid5 will degrade with 1 disk fail and no spare disk here, then do ddresue to a new disk which will have same uuid and everything, then put it back, then bitmap will shine here right? so raid5 is still online while that disk is not part of raid5 now. and no diskio on it at all. so do not think i need an atomic operation here. if you fail the disk from the array, or boot without the failing disk, then the event counter in the other superblocks will be updated... and the removed/failed disk will no longer be considered an up to date component... so after doing the ddrescue you'd need to reassemble the raid5. i'm not sure you can convince md to use the bitmap in this case -- i'm just not familiar enough with it. this raid5 over raid1 way sounds interesting. worthy trying. let us know how it goes :) i've considered doing this a few times myself... but i've been too conservative and just taken the system down to single user to do the ddrescue with the raid offline entirely. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to clone a disk
On Sat, 11 Mar 2006, Ming Zhang wrote: On Sat, 2006-03-11 at 16:31 -0800, dean gaudet wrote: if you fail the disk from the array, or boot without the failing disk, then the event counter in the other superblocks will be updated... and the removed/failed disk will no longer be considered an up to date component... so after doing the ddrescue you'd need to reassemble the raid5. i'm not sure you can convince md to use the bitmap in this case -- i'm just not familiar enough with it. i am little confused here. then what the purpose of that bitmap for? is not that bitmap is for a component temporarily out of place and thus out of sync a bit? hmm... yeah i suppose that is the purpose of the bitmap... i haven't used bitmaps yet though... so i don't know which types of events they protect against. in theory what you want to do sounds like it should work though, but i'd experiment somewhere safe first. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Auto-assembling arrays using mdadm
On Thu, 9 Mar 2006, Sean Puttergill wrote: This is the kind of functionality provided by kernel RAID autodetect. You don't have to have any config information provided in advance. The kernel finds and assembles all arrays on disks with RAID autodetect partition type. I want to do the same thing, but with mdadm. you know i've been bitten by that several times... the kernel sees the autodetect partition type of a disk which used to belong in another box (which for various reasons i've not yet been able to zero the old raid superblock)... and brings up a raid minor which conflicts with another array already present in the system... which causes device renaming to occur and can mess up the boot. (although if i'm using UUIDs or labels for mounting the filesystems it can almost work -- there are still a few cases where those don't help... such as xfs external log partition.) i suppose what i suggested doesn't do what you want... but i prefer not using kernel autoassembly these days because of the above problem. the 1.12 man page has more examples which can help you... echo DEVICE partitions tmp.mdadm.conf mdadm --detail --scan --config=tmp.mdadm.conf tmp.mdadm.conf mdadm --assemble --scan --config=tmp.mdadm.conf i think that'll do what you want... -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVRAM support
On Fri, 10 Feb 2006, Bill Davidsen wrote: Erik Mouw wrote: You could use it for an external journal, or you could use it as a swap device. Let me concur, I used external journal on SSD a decade ago with jfs (AIX). If you do a lot of operations which generate journal entries, file create, delete, etc, then it will double your performance in some cases. Otherwise it really doesn't help much, use as a swap device might be more helpful depending on your config. it doesn't seem to make any sense at all to use a non-volatile external memory for swap... swap has no purpose past a power outage. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 Debian Yaird Woes
On Sun, 5 Feb 2006, Lewis Shobbrook wrote: On Saturday 04 February 2006 11:22 am, you wrote: On Sat, 4 Feb 2006, Lewis Shobbrook wrote: Is there any way to avoid this requirement for input, so that the system skips the missing drive as the raid/initrd system did previously? what boot errors are you getting before it drops you to the root password prompt? Basically it just states waiting X seconds for /dev/sdx3 (corresponding to the missing raid5 member). Where X cycles from 2,4,8,16 and then drops you into a recovery console, no root pwd prompt. It will only occur if the partition is completely missing, such as a replacement disk with a blank partition table, or a completely missing/failed drive. is it trying to fsck some filesystem it doesn't have access to? No fsck seen for bad extX partitions etc. try something like this... cd /tmp mkdir t cd t zcat /boot/initrd.img-`uname -r` | cpio -i grep -r sd.3 . that should show us what script is directly accessing /dev/sdx3 ... maybe there's something more we can do about it. i did find a possible deficiency with the patch i posted... looking more closely at my yaird /init i see this: mkbdev '/dev/sdb' 'sdb' mkbdev '/dev/sdb4' 'sdb/sdb4' mkbdev '/dev/sda' 'sda' mkbdev '/dev/sda4' 'sda/sda4' and i think that means that mdadm -Ac partitions will fail if one of my root disks ends up somewhere other than sda or sdb... because the device nodes won't exist. i suspect i should update the patch to use mdrun instead of mdadm -Ac partitions... because mdrun will create temporary device nodes for everything in /proc/partitions in order to find all the possible raid pieces. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 Debian Yaird Woes
On Sat, 4 Feb 2006, Lewis Shobbrook wrote: Is there any way to avoid this requirement for input, so that the system skips the missing drive as the raid/initrd system did previously? what boot errors are you getting before it drops you to the root password prompt? is it trying to fsck some filesystem it doesn't have access to? -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 Debian Yaird Woes
i've never looked at yaird in detail -- but you can probably use initramfs-tools instead of yaird... the deb 2.6.14 and later kernels will use whichever one of those is installed. i know that initramfs-tools uses mdrun to start the root partition based on its UUID -- and so it should work fine (to get root mounted) even without dorking around with mdadm.conf. but if you want to stick with yaird: On Fri, 3 Feb 2006, Lewis Shobbrook wrote: My mdadm.conf (I never needed to use at all previous to the yaird system) is as follows... ARRAY /dev/md0 level=raid1 num-devices=3 devices=/dev/sda2,/dev/sdb2,/dev/sdc2 auto=yes ARRAY /dev/md1 level=raid5 num-devices=3 auto=yes UUID=a3452240:a1578a31:737679af:58f53690 DEVICE partitions some wrapping occured there i'm guessing... you might be a lot happier if your /dev/md0 also specified the UUID rather than the individual devices. this is probably the source of your troubles. you can get the UUID by doing mdadm --examine /dev/sda2. or you can try: mdadm --examine --scan --brief ... just prepend DEVICE partitions in front of that and you should be happy. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 Debian Yaird Woes
On Thu, 2 Feb 2006, dean gaudet wrote: i've never looked at yaird in detail -- but you can probably use initramfs-tools instead of yaird... i take it all back... i just tried initramfs-tools and it failed to boot my system properly... whereas yaird almost got everything right. the main thing i'd say yaird is doing wrong is that it is specifying the root raid devices explicitly rather than allowing mdadm to scan the partitions list and assemble by UUID... maybe try the patch below on your yaird configuration and then run: dpkg-reconfigure linux-image-`uname -r` which will rebuild your initrd with this change... then see if it survives your boot testing. -dean p.s. this patch has been submitted to debian bugdb... --- /etc/yaird/Templates.cfg2006/02/03 02:44:49 1.1 +++ /etc/yaird/Templates.cfg2006/02/03 02:46:15 @@ -299,8 +299,7 @@ SCRIPT /init BEGIN !mknod TMPL_VAR NAME=target b TMPL_VAR NAME=major TMPL_VAR NAME=minor - !mdadm --assemble TMPL_VAR NAME=target --uuid TMPL_VAR NAME=uuid \ - ! TMPL_LOOP NAME=components TMPL_VAR NAME=dev/TMPL_LOOP + !mdadm -Ac partitions TMPL_VAR NAME=target --uuid TMPL_VAR NAME=uuid END SCRIPT END TEMPLATE - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Configuring combination of RAID-1 RAID-5
On Tue, 31 Jan 2006, Enrique Garcia Briones wrote: I have read the setting-up for the raid-5 and 1, but I would like to ask you if I can set-up a combined RAID configuration as mentioned above, since all the examples I found upto now just talk of one RAD configuration you can have more than one /dev/mdN device no problem... if you've got a raid1 root disk setup on a box with raid5/6, and you want to use a journalling filesystem on the raid5/6 you should seriously consider saving space on the root disk for an external journal for the raid5/6 filesystem. it really helps metadata-heavy operations to offload the journal writes to the raid1 spindles. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Updating superblock to reflect new disc locations
On Thu, 12 Jan 2006, Neil Brown wrote: On Wednesday January 11, [EMAIL PROTECTED] wrote: Any suggestions would be greatly appreciated. The system's new and not yet in production, so I can reinstall it if I have to, but I'd prefer to be able to fix something as simple as this. Debian's installer - the mkinitrd part in particular - is broken. If you look in the initrd (it is just a compressed CPIO file) you will find the mdadm command used to start the root array explicitly lists the devices to use. As soon as you change the devices, it stops working :-( Someone should tell them about uuids. actually i did tell them http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=338200 ... however (a) i sent the bug report well after the freeze point for the last stable release and (b) i think initrd-tools is on its way out... folks in the unstable branch are forced to use yaird or initramfs-tools for kernels 2.6.14 and beyond. so i'm guessing the bug report will go nowhere -- but i haven't looked at either of those tools yet to see if they handle md root well. I think you can probably fix you situation by: Booting up and having a degraded array hot-add the missing device and wait for it to rebuilt rerun mkinitrd The last bit might require some learning on your part. I don't know in Debian's mkinitrd requires and command line args, or where it puts the result, or whether you have to tell lilo/grub about the new files. basically run something like this: dpkg-reconfigure linux-image-`uname -r` (or possibly kernel-image-`uname -r` if you're on unstable branch prior to the renaming of the kernel packages.) you really want to do that for all of your linux-image-* or kernel-image-* packages so they all get the new root locations. another option without the rebuild is to boot a live-cd like knoppix and then hand-edit and repack the cramfs image from /boot... there's a script in the root or otherwise easy to find which has the mdadm command for assembling your root partition. i'll skip the details though... i'd rather not try to get them all correct in a quick email. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Journal-guided Resynchronization for Software RAID
On Mon, 5 Dec 2005, Neil Brown wrote: One of these with built in xor and raid6 would be nice, but I'm not sure I could guarantee a big enough market for them to try convincing them to make one... i wonder if the areca cards http://www.areca.com.tw/ are re-programmable... they seem to have all the hardware you're looking for. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Journal-guided Resynchronization for Software RAID
On Thu, 1 Dec 2005, Neil Brown wrote: What I would really like is a cheap (Well, not too expensive) board that had at least 100Meg of NVRAM which was addressable on the PCI buss, and an XOR and RAID-6 engine connected to the DMA engine. there's the mythical giga-byte i-ram ... i say mythical because i've seen lots of reviews but haven't been able to find it for sale: http://www.giga-byte.com/Peripherals/Products/Products_GC-RAMDISK%20(Rev%201.1).htm the only problem with the i-ram is the lack of ecc (it could be implemented in a software layer though). umem.com have what look like excellent boards but they seem unwilling to sell in small quantities... http://umem.com/Umem_NVRAM_Cards.html -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still Need Help on mdadm and udev
On Thu, 10 Nov 2005, Bill Davidsen wrote: I haven't had a good use for a partitionable device i've used it to have root, swap, and some external xfs/ext3 logs on a single raid1... (the xfs/ext3 logs are for filesystems on another raid5) rather than managing 4 or 5 separate raid1s on the same 2 disks. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: s/w raid and bios renumbering HDs
On Mon, 31 Oct 2005, Hari Bhaskaran wrote: Hi, I am trying to setup a RAID-1 setup for the boot/root partition. I got the setup working, except what I see with some of my tests leave me less convinced that it is actually working. My system is debian 3.1 and I am not using the raid-setup options in the debian-installer, I am trying to add raid-1 to an existing system (followed http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html -- 7.4 method 2) fyi there's a debian-specific doc at /usr/share/doc/mdadm/rootraiddoc.97.html which i've always found useful. I have /dev/hda (master on primary) and /dev/hdc (master on secondary) setup as mirrors. I also have a cdrom on /dev/hdd. Now if I disconnect hda and reboot, everything seems work - except what used to be /dev/hdc comes up as /dev/hda. I know this since I the bios does complain that primary disk 0 is missing and I would have expected a missing hda, not a missing hdc. huh i wonder if the bios has tweaked the ide controller to swap the primary/secondary somehow -- probably cuts down on support calls for people who plug things in wrong. there could be a bios option to stop this swapping. Anyways, the software seems to recognize the failed-disk fine if I connect the real hda back. Is this the way it is supposed to work? Can I rely on this? Also what happens when I move on to fancier setups like raid5?. the md superblock (at the end of the partition) contains reconstruction information and UUIDs... the device names they end up on are mostly irrelevant if you've got things configured properly. i've moved disks between /dev/hd* and /dev/sd* going from pata controllers to 3ware controllers with no problem. for raids other than the root raid you pretty much want to edit /etc/mdadm/mdadm.conf and make sure it has DEVICE partitions and has ARRAY entries for each of your arrays listing the UUID. you can generate these entries with mdadm --detail --scan (see examples on man page). you can plug the non-root disks in any way you want and things will still work if you've configured this. the root is the only one which you need to be careful with -- when debian installs your kernel it constructs an initrd which lists the minimum places it will search for the root raid components... for example on one of my boxes: # mkdir /mnt/cramfs # mount -o ro,loop /boot/initrd.img-2.6.13-1-686-smp /mnt/cramfs # cat /mnt/cramfs/script ROOT=/dev/md3 mdadm -A /dev/md3 -R -u 2b3a5b77:c7b4ab81:a2b8322a:db5c4e88 /dev/sdb4 /dev/sda4 # umount /mnt/cramfs it's only expecting to look for the root raid components in those two partitions... seems kind of unfortunate really 'cause the script could be configured to look in any partition. in theory you can hand-edit the initrd if you plan to move root disks to another position... you can't mount a cramfs rw, so you need to mount, copy, edit, and run mkcramfs ... and i suggest not deleting your original initrd, and i suggest copypasting the /boot/grub/menu.lst entries to give you the option of booting the old initrd or your new made-by-hand one. title Debian GNU/Linux, kernel 2.6.13.3-vs2.1.0-rc4-RAID-hda root(hd0,0) kernel /boot/vmlinuz-2.6.13.3-vs2.1.0-rc4 root=/dev/md0 ro initrd /boot/initrd.img-2.6.13.3-vs2.1.0-rc4.md0 savedefault boot title Debian GNU/Linux, kernel 2.6.13.3-vs2.1.0-rc4-RAID-hdc root(hd1,0) kernel /boot/vmlinuz-2.6.13.3-vs2.1.0-rc4 root=/dev/md0 ro initrd /boot/initrd.img-2.6.13.3-vs2.1.0-rc4.md0 savedefault boot i don't think you need both. when your first disk is dead the bios shifts the second disk forward... and hd0 / hd1 refer to bios ordering. i don't have both in my configs, but then i haven't bothered testing booting off the second disk in a long time. (i always have live-cds such as knoppix handy for fixing boot problems.) -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: s/w raid and bios renumbering HDs
On Mon, 31 Oct 2005, Hari Bhaskaran wrote: So that DEVICE paritions line was really supposed to be there? Hehe... I thought it was just a help message and replaced it with DEVICE /dev/hda1 /dev/hdc1 :) you can use DEVICE /dev/hda1 /dev/hdc1 ... but then mdadm scans will only consider those two partitions... if you use DEVICE partitions it'll look at all detected partitions for the components. it makes it easy when you move disks around to new controllers and their location changes, things will continue to jfw. If I ever end up in a situation with a non-root raid down (say I did --stop), how do I start it back up? (--run seems to give me some errors). Anyways, more rtfm to do. you want --assemble ... the root is the only one which you need to be careful with -- when debian installs your kernel it constructs an initrd which lists the minimum places it will search for the root raid components... for example on one of my boxes: # mkdir /mnt/cramfs # mount -o ro,loop /boot/initrd.img-2.6.13-1-686-smp /mnt/cramfs # cat /mnt/cramfs/script ROOT=/dev/md3 mdadm -A /dev/md3 -R -u 2b3a5b77:c7b4ab81:a2b8322a:db5c4e88 /dev/sdb4 /dev/sda4 # umount /mnt/cramfs Did u install yours with raid options in the debian installer? I dont think my initrd image would have all these ( I dont have access to the machine now to check) - but I wouldn't think the mkinitrd that I used to created the initrd image would know that I am using raid or not ( I am talking about the mdadm references in your script). Or are you saying you added these yourself? there's really no reason to avoid the debian installer's raid support if you know you want raid, but i haven't used it much. you only need to do initrd edits by hand once if you're converting root to a raid. there's a few steps in the debian doc about this (see Part II. RAID using initrd and grub) /usr/share/doc/mdadm/rootraiddoc.97.html. after that initial change, and you've managed to boot with root on raid then subsequent mkinitrd *should* fill in the details automatically... i.e. every time you upgrade your kernel you get a new initrd, and it should automatically include the root raid setup. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: split RAID1 during backups?
On Mon, 24 Oct 2005, Jeff Breidenbach wrote: First of all, if the data is mostly static, rsync might work faster. Any operation that stats the individual files - even to just look at timestamps - takes about two weeks. Therefore it is hard for me to see rsync as a viable solution, even though the data is mostly static. About 400,000 files change between weekly backups. taking a long time to stat individual files makes me wonder if you're suffering from atime updates and O(n) directory lookups... have you tried this: - mount -o noatime,nodiratime - tune2fs -O dir_index (and e2fsck -D) (you need recentish e2fsprogs for this, and i'm pretty sure you want 2.6.x kernel) a big hint you're suffering from atime updates is write traffic when your fs is mounted rw, and your static webserver is the only thing running (and your logs go elsewhere)... atime updates are probably the only writes then. try iostat -x 5. a big hint you're suffering from O(n) directory lookups is heaps of system time... (vmstat or top). On Mon, 24 Oct 2005, Brad Campbell wrote: mount -o remount,ro /dev/md0 /web mdadm --fail /dev/md0 /dev/sdd1 mdadm --remove /dev/md0 /dev/sdd1 mount -o ro /dev/sdd1 /target do backup here umount /target mdadm -add /dev/md0 /dev/sdd1 mount -o remount,rw /dev/md0 /web the md event counts would be out of sync and unless you're using bitmapped intent logging this would cause a full resync. if the raid wasn't online you could probably use one of the mdadm options to force the two devices to be a sync'd raid1 ... but i'm guessing you wouldn't be able to do it online. other 2.6.x bleeding edge options are to mark one drive as write-mostly so that you have no read traffic competition while doing a backup... or just use the bitmap intent logging and a nbd to add a third, networked, copy of the drive on another machine. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: split RAID1 during backups?
On Mon, 24 Oct 2005, Jeff Breidenbach wrote: Dean, the comment about write-mostly is confusing to me. Let's say I somehow marked one of the component drives write-mostly to quiet it down. How do I get at it? Linux will not let me mount the component partition if md0 is also mounted. Do you think write-mostly or write-behind are likely enough to be magic bullets that I should learn all about them? if one drive is write-mostly, and you remount the filesystem read-only... then no writes should be occuring... and you could dd from the component drive directly and get a consistent fs image. (i'm assuming you can remount the filesystem read-only for the duration of the backup because it sounds like that's how you do it now; and i'm assuming you're happy enough with your dd_rescue image...) myself i've been considering a related problem... i don't trust LVM/DM snapshots in 2.6.x yet, and i've been holding back a 2.4.x box waiting for them to stabilise... but that seems to be taking a long time. the box happens to have a 3-way raid1 anyhow, and 2.6.x bitmapped intent logging would give me a great snapshot backup option: just break off one disk during the backup and put it back in the mirror when done. there's probably one problem with this 3-way approach... i'll need some way to get the fs (ext3) to reach a safe point where no log recovery would be required on the disk i break out of the mirror... because under no circumstances do you want to write on the disk while it's outside the mirror. (LVM snapshotting in 2.4.x requires a VFS lock patch which does exactly this when you create a snapshot.) John, I'm using 4KB blocks in reiserfs with tail packing. i didn't realise you were using reiserfs... i'd suggest disabling tail packing... but then i've never used reiser, and i've only ever seen reports of tail packing having serious performance impact. you're really only saving yourself an average of half a block per inode... maybe try a smaller block size if the disk space is an issue due to lots of inodes. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID6 Query
On Tue, 16 Aug 2005, Colonel Hell wrote: I just went thru a couple of papers describing RAID6. I dunno how relevant this discussion grp is for the qry ...but here I go :) ... I couldnt figure out why is P+Q configuration better over P+q' where q' == P. What I mean is instead of calculating a new checksum (thru a lot of GF theory etc) just store the parity block (P)again. In this case as well we have the same amount of fault tolerance or not :-s ... this is no better than raid5 at surviving a two disk failure. i.e. consider the case of two data blocks missing -- you can't reconstruct if all you have is parity. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SOLVED: forcing boot ordering of multilevel RAID arrays
On Sun, 7 Aug 2005, Trevor Cordes wrote: Any array that is a superset of other arrays (a multilevel array) must set to non-autodetect. Use fdisk to change the parition type to 83 (standard linux), NOT fd (linux raid autodetect). you know i'd be worried setting it to 0x83 will cause troubles now and then with tools assuming it's really a filesystem... personally i've used 0xDA Non-FS data for such things in the past. i didn't really see any type more appropriate... i'm hoping no tool assumes anything about 0xDA. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.4.30: bug in file md.c, line 2473
i got the following bug from 2.4.30 while trying to hot add a device tonight... i was trying to replace a disk in a 3-way raid1 -- the existing disks are sda, sdb, and i was replacing sdc. each of these disks has 3 partitions, each with a raid1. due to an improper shutdown the raids were being sync'd... specifically: Personalities : [raid1] read_ahead 1024 sectors md0 : active raid1 sdb1[2] sda1[0] 7823552 blocks [3/2] [U_U] resync=DELAYED md1 : active raid1 sdb2[2] sda2[0] 3911744 blocks [3/2] [U_U] resync=DELAYED md2 : active raid1 sdb3[1] sda3[0] 108318144 blocks [3/2] [UU_] [==..] resync = 13.3% (14415248/108318144) finish=59.3min speed=26366K/sec md3 : active raid1 sde1[1] sdd1[0] 199141632 blocks [3/2] [UU_] [=...] resync = 6.0% (12045888/199141632) finish=145.1min speed=21480K/sec and then i tried: # mdadm /dev/md1 -a /dev/sdc2 mdadm: hot add failed for /dev/sdc2: Invalid argument and the dmesg bug output is pasted below. let me know if there's more info you'd like ... or if i did something dumb :) thanks -dean md: trying to hot-add sdc2 to md1 ... md: bindsdc2,3 md: bug in file md.c, line 2473 md: ** md: * COMPLETE RAID STATE PRINTOUT * md: ** md0: sdb1sda1 array superblock: md: SB: (V:0.90.0) ID:54a41317.8d4ba606.7dd1ac2d.43c65883 CT:42621381 md: L1 S07823552 ND:4 RD:3 md0 LO:0 CS:0 md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0 CSUM:d1f282a0 E:0024 D 0: DISKN:0,sda1(8,1),R:0,S:6 D 1: DISKN:1,[dev 00:00](0,0),R:1,S:9 D 2: DISKN:2,sdb1(8,17),R:2,S:6 D 3: DISKN:3,[dev 00:00](0,0),R:3,S:1 D 4: DISKN:4,[dev 00:00](0,0),R:4,S:1 md: THIS: DISKN:2,sdb1(8,17),R:2,S:6 md: rdev sdb1: O:sdb1, SZ:07823552 F:0 DN:2 6md: rdev superblock: md: SB: (V:0.90.0) ID:54a41317.8d4ba606.7dd1ac2d.43c65883 CT:42621381 md: L1 S07823552 ND:4 RD:3 md0 LO:0 CS:0 md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0 CSUM:d1f28438 E:0024 D 0: DISKN:0,sda1(8,1),R:0,S:6 D 1: DISKN:1,[dev 00:00](0,0),R:1,S:9 D 2: DISKN:2,sdb1(8,17),R:2,S:6 D 3: DISKN:3,[dev 00:00](0,0),R:3,S:1 D 4: DISKN:4,[dev 00:00](0,0),R:4,S:1 md: THIS: DISKN:2,sdb1(8,17),R:2,S:6 md: rdev sda1: O:sda1, SZ:07823552 F:0 DN:0 6md: rdev superblock: md: SB: (V:0.90.0) ID:54a41317.8d4ba606.7dd1ac2d.43c65883 CT:42621381 md: L1 S07823552 ND:4 RD:3 md0 LO:0 CS:0 md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0 CSUM:d1f28424 E:0024 D 0: DISKN:0,sda1(8,1),R:0,S:6 D 1: DISKN:1,[dev 00:00](0,0),R:1,S:9 D 2: DISKN:2,sdb1(8,17),R:2,S:6 D 3: DISKN:3,[dev 00:00](0,0),R:3,S:1 D 4: DISKN:4,[dev 00:00](0,0),R:4,S:1 md: THIS: DISKN:0,sda1(8,1),R:0,S:6 md1: sdc2sdb2sda2 array superblock: md: SB: (V:0.90.0) ID:94d6f5e3.ce487a3a.fb358d6f.375e923c CT:4262138d md: L1 S03911744 ND:4 RD:3 md1 LO:0 CS:0 md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0 CSUM:c3e2a424 E:001c D 0: DISKN:0,sda2(8,2),R:0,S:6 D 1: DISKN:1,[dev 00:00](0,0),R:1,S:9 D 2: DISKN:2,sdb2(8,18),R:2,S:6 D 3: DISKN:3,sdc2(8,34),R:3,S:1 D 4: DISKN:4,[dev 00:00](0,0),R:4,S:1 md: THIS: DISKN:2,sdb2(8,18),R:2,S:6 md: rdev sdc2: O:sdc2, SZ:03911744 F:0 DN:-1 6md: rdev superblock: md: SB: (V:1.-150761216.0) ID:.f6b49250.f6ab2d80. CT: md: L-156056320 S-156594432 ND:0 RD:-156573696 md0 LO:0 CS:0 md: UT:f6ab2a80 ST:0 AD:-156573696 WD:0 FD:0 SD:-155937712 CSUM: E: D 20: DISKN:483,[dev e3:e3](4194787,8389091),R:12583395,S:16777699 D 21: DISKN:134218211,[dev e3:e3](138412515,142606819),R:146801123,S:150995427 D 22: DISKN:268435939,[dev e3:e3](272630243,276824547),R:281018851,S:285213155 D 23: DISKN:402653667,[dev e3:e3](406847971,411042275),R:415236579,S:419430883 D 24: DISKN:536871395,[dev e3:e3](541065699,545260003),R:549454307,S:553648611 D 25: DISKN:671089123,[dev e3:e3](675283427,679477731),R:683672035,S:687866339 D 26: DISKN:805306851,[dev e3:e3](809501155,813695459),R:817889763,S:822084067 md: THIS: DISKN:0,[dev 20:67](0,935010407),R:921989223,S:0 md: rdev sdb2: O:sdb2, SZ:03911744 F:0 DN:2 6md: rdev superblock: md: SB: (V:0.90.0) ID:94d6f5e3.ce487a3a.fb358d6f.375e923c CT:4262138d md: L1 S03911744 ND:4 RD:3 md1 LO:0 CS:0 md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0 CSUM:c3e2a5bc E:001c D 0: DISKN:0,sda2(8,2),R:0,S:6 D 1: DISKN:1,[dev 00:00](0,0),R:1,S:9 D 2: DISKN:2,sdb2(8,18),R:2,S:6 D 3: DISKN:3,[dev 00:00](0,0),R:3,S:1 D 4: DISKN:4,[dev 00:00](0,0),R:4,S:1 md: THIS: DISKN:2,sdb2(8,18),R:2,S:6 md: rdev sda2: O:sda2, SZ:03911744 F:0 DN:0 6md: rdev superblock: md: SB: (V:0.90.0) ID:94d6f5e3.ce487a3a.fb358d6f.375e923c CT:4262138d md: L1 S03911744 ND:4 RD:3 md1 LO:0 CS:0 md: UT:426601d8 ST:0 AD:2 WD:2 FD:2 SD:0