[PATCH 000 of 2] md: Two more bugfixes.
Following are two bugfixes for md in current kernels. The first is suitable for -stable is it can allow drive errors through to the filesystem wrongly. Both are suitable for 2.6.22. Thanks, NeilBrown [PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem. [PATCH 002 of 2] md: Improve the is_mddev_idle test - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 001 of 2] md: Avoid a possibility that a read error can wrongly propagate through md/raid1 to a filesystem.
When a raid1 has only one working drive, we want read error to propagate up to the filesystem as there is no point failing the last drive in an array. Currently the code perform this check is racy. If a write and a read a both submitted to a device on a 2-drive raid1, and the write fails followed by the read failing, the read will see that there is only one working drive and will pass the failure up, even though the one working drive is actually the *other* one. So, tighten up the locking. Cc: [EMAIL PROTECTED] Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/raid1.c | 33 +++-- 1 file changed, 19 insertions(+), 14 deletions(-) diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c --- .prev/drivers/md/raid1.c2007-05-10 15:51:54.0 +1000 +++ ./drivers/md/raid1.c2007-05-10 15:51:58.0 +1000 @@ -271,21 +271,25 @@ static int raid1_end_read_request(struct */ update_head_pos(mirror, r1_bio); - if (uptodate || (conf-raid_disks - conf-mddev-degraded) = 1) { - /* -* Set R1BIO_Uptodate in our master bio, so that -* we will return a good error code for to the higher -* levels even if IO on some other mirrored buffer fails. -* -* The 'master' represents the composite IO operation to -* user-side. So if something waits for IO, then it will -* wait for the 'master' bio. + if (uptodate) + set_bit(R1BIO_Uptodate, r1_bio-state); + else { + /* If all other devices have failed, we want to return +* the error upwards rather than fail the last device. +* Here we redefine uptodate to mean Don't want to retry */ - if (uptodate) - set_bit(R1BIO_Uptodate, r1_bio-state); + unsigned long flags; + spin_lock_irqsave(conf-device_lock, flags); + if (r1_bio-mddev-degraded == conf-raid_disks || + (r1_bio-mddev-degraded == conf-raid_disks-1 +!test_bit(Faulty, conf-mirrors[mirror].rdev-flags))) + uptodate = 1; + spin_unlock_irqrestore(conf-device_lock, flags); + } + if (uptodate) raid_end_bio_io(r1_bio); - } else { + else { /* * oops, read error: */ @@ -992,13 +996,14 @@ static void error(mddev_t *mddev, mdk_rd unsigned long flags; spin_lock_irqsave(conf-device_lock, flags); mddev-degraded++; + set_bit(Faulty, rdev-flags); spin_unlock_irqrestore(conf-device_lock, flags); /* * if recovery is running, make sure it aborts. */ set_bit(MD_RECOVERY_ERR, mddev-recovery); - } - set_bit(Faulty, rdev-flags); + } else + set_bit(Faulty, rdev-flags); set_bit(MD_CHANGE_DEVS, mddev-flags); printk(KERN_ALERT raid1: Disk failure on %s, disabling device. \n Operation continuing on %d devices\n, - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 002 of 2] md: Improve the is_mddev_idle test
During a 'resync' or similar activity, md checks if the devices in the array are otherwise active and winds back resync activity when they are. This test in done in is_mddev_idle, and it is somewhat fragile - it sometimes thinks there is non-sync io when there isn't. The test compares the total sectors of io (disk_stat_read) with the sectors of resync io (disk-sync_io). This has problems because total sectors gets updated when a request completes, while resync io gets updated when the request is submitted. The time difference can cause large differenced between the two which do not actually imply non-resync activity. The test currently allows for some fuzz (+/- 4096) but there are some cases when it is not enough. The test currently looks for any (non-fuzz) difference, either positive or negative. This clearly is not needed. Any non-sync activity will cause the total sectors to grow faster than the sync_io count (never slower) so we only need to look for a positive differences. If we do this then the amount of in-flight sync io will never cause the appearance of non-sync IO. Once enough non-sync IO to worry about starts happening, resync will be slowed down and the measurements will thus be more precise (as there is less in-flight) and control of resync will still be suitably responsive. Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2007-05-10 15:51:54.0 +1000 +++ ./drivers/md/md.c 2007-05-10 16:05:10.0 +1000 @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev) * * Note: the following is an unsigned comparison. */ - if ((curr_events - rdev-last_events + 4096) 8192) { + if ((long)curr_events - (long)rdev-last_events 4096) { rdev-last_events = curr_events; idle = 0; } - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: removed disk md-device
On Wednesday May 9, [EMAIL PROTECTED] wrote: Neil Brown [EMAIL PROTECTED] [2007.04.02.0953 +0200]: Hmmm... this is somewhat awkward. You could argue that udev should be taught to remove the device from the array before removing the device from /dev. But I'm not convinced that you always want to 'fail' the device. It is possible in this case that the array is quiescent and you might like to shut it down without registering a device failure... Hmm, the the kernel advised hotplug to remove the device from /dev, but you don't want to remove it from md? Do you have an example for that case? Until there is known to be an inconsistency among the devices in an array, you don't want to record that there is. Suppose I have two USB drives with a mounted but quiescent filesystem on a raid1 across them. I pull them both out, one after the other, to take them to my friends place. I plug them both in and find that the array is degraded, because as soon as I unplugged on, the other was told that it was now the only one. Not good. Best to wait for an IO request that actually returns an errors. Maybe an mdadm command that will do that for a given device, or for all components of a given array if the 'dev' link is 'broken', or even for all devices for all array. mdadm --fail-unplugged --scan or mdadm --fail-unplugged /dev/md3 Ok, so one could run this as cron script. Neil, may I ask if you already started to work on this? Since we have the problem on a customer system, we should fix it ASAP, but at least within the next 2 or 3 weeks. If you didn't start work on it yet, I will do... No, I haven't, but it is getting near the top of my list. If you want a script that does this automatically for every array, something like: for a in /sys/block/md*/md/dev-* do if [ -f $a/block/dev ] then : still there else echo faulty $a/state echo remove $a/state fi done should do what you want. (I haven't tested it though). NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test
On Thu, 10 May 2007 16:22:31 +1000 NeilBrown [EMAIL PROTECTED] wrote: The test currently looks for any (non-fuzz) difference, either positive or negative. This clearly is not needed. Any non-sync activity will cause the total sectors to grow faster than the sync_io count (never slower) so we only need to look for a positive differences. ... --- .prev/drivers/md/md.c 2007-05-10 15:51:54.0 +1000 +++ ./drivers/md/md.c 2007-05-10 16:05:10.0 +1000 @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev) * * Note: the following is an unsigned comparison. */ - if ((curr_events - rdev-last_events + 4096) 8192) { + if ((long)curr_events - (long)rdev-last_events 4096) { rdev-last_events = curr_events; idle = 0; In which case would unsigned counters be more appropriate? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test
On Thursday May 10, [EMAIL PROTECTED] wrote: On Thu, 10 May 2007 16:22:31 +1000 NeilBrown [EMAIL PROTECTED] wrote: The test currently looks for any (non-fuzz) difference, either positive or negative. This clearly is not needed. Any non-sync activity will cause the total sectors to grow faster than the sync_io count (never slower) so we only need to look for a positive differences. ... --- .prev/drivers/md/md.c 2007-05-10 15:51:54.0 +1000 +++ ./drivers/md/md.c 2007-05-10 16:05:10.0 +1000 @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev) * * Note: the following is an unsigned comparison. */ - if ((curr_events - rdev-last_events + 4096) 8192) { + if ((long)curr_events - (long)rdev-last_events 4096) { rdev-last_events = curr_events; idle = 0; In which case would unsigned counters be more appropriate? I guess. It is really the comparison that I want to be signed, I don't much care about the counted - they are expected to wrap (though they might not). So maybe I really want if ((signed long)(curr_events - rdev-last_events) 4096) { to make it clear... But people expect number to be signed by default, so that probably isn't necessary. Yeah, I'll make them signed one day. Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test
On May 10 2007 16:22, NeilBrown wrote: diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2007-05-10 15:51:54.0 +1000 +++ ./drivers/md/md.c 2007-05-10 16:05:10.0 +1000 @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev) * * Note: the following is an unsigned comparison. */ - if ((curr_events - rdev-last_events + 4096) 8192) { + if ((long)curr_events - (long)rdev-last_events 4096) { rdev-last_events = curr_events; idle = 0; } What did really change? Unless I am seriously mistaken, curr_events - last_evens + 4096 8192 is mathematically equivalent to curr_events - last_evens 4096 The casting to (long) may however force a signed comparison which turns things quite upside down, and the comment does not apply anymore. Jan -- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test
On Thursday May 10, [EMAIL PROTECTED] wrote: On May 10 2007 16:22, NeilBrown wrote: diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c2007-05-10 15:51:54.0 +1000 +++ ./drivers/md/md.c2007-05-10 16:05:10.0 +1000 @@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev) * * Note: the following is an unsigned comparison. */ -if ((curr_events - rdev-last_events + 4096) 8192) { +if ((long)curr_events - (long)rdev-last_events 4096) { rdev-last_events = curr_events; idle = 0; } What did really change? Unless I am seriously mistaken, curr_events - last_evens + 4096 8192 is mathematically equivalent to curr_events - last_evens 4096 The casting to (long) may however force a signed comparison which turns things quite upside down, and the comment does not apply anymore. Yes, the use of a signed comparison is the significant difference. And yes, the comment becomes wrong. I'm in the process of redrafting that. It currently stands at: /* sync IO will cause sync_io to increase before the disk_stats * as sync_io is counted when a request starts, and * disk_stats is counted when it completes. * So resync activity will cause curr_events to be smaller than * when there was no such activity. * non-sync IO will cause disk_stat to increase without * increasing sync_io so curr_events will (eventually) * be larger than it was before. Once it becomes * substantially larger, the test below will cause * the array to appear non-idle, and resync will slow * down. * If there is a lot of outstanding resync activity when * we set last_event to curr_events, then all that activity * completing might cause the array to appear non-idle * and resync will be slowed down even though there might * not have been non-resync activity. This will only * happen once though. 'last_events' will soon reflect * the state where there is little or no outstanding * resync requests, and further resync activity will * always make curr_events less than last_events. * */ Does that read at all well? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 002 of 2] md: Improve the is_mddev_idle test
On May 10 2007 20:04, Neil Brown wrote: - if ((curr_events - rdev-last_events + 4096) 8192) { + if ((long)curr_events - (long)rdev-last_events 4096) { rdev-last_events = curr_events; idle = 0; } /* sync IO will cause sync_io to increase before the disk_stats * as sync_io is counted when a request starts, and * disk_stats is counted when it completes. * So resync activity will cause curr_events to be smaller than * when there was no such activity. * non-sync IO will cause disk_stat to increase without * increasing sync_io so curr_events will (eventually) * be larger than it was before. Once it becomes * substantially larger, the test below will cause * the array to appear non-idle, and resync will slow * down. * If there is a lot of outstanding resync activity when * we set last_event to curr_events, then all that activity * completing might cause the array to appear non-idle * and resync will be slowed down even though there might * not have been non-resync activity. This will only * happen once though. 'last_events' will soon reflect * the state where there is little or no outstanding * resync requests, and further resync activity will * always make curr_events less than last_events. * */ Does that read at all well? It is a more verbose explanation of your patch description, yes. Jan -- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: removed disk md-device
Neil Brown wrote: On Wednesday May 9, [EMAIL PROTECTED] wrote: Neil Brown [EMAIL PROTECTED] [2007.04.02.0953 +0200]: Hmmm... this is somewhat awkward. You could argue that udev should be taught to remove the device from the array before removing the device from /dev. But I'm not convinced that you always want to 'fail' the device. It is possible in this case that the array is quiescent and you might like to shut it down without registering a device failure... Hmm, the the kernel advised hotplug to remove the device from /dev, but you don't want to remove it from md? Do you have an example for that case? Until there is known to be an inconsistency among the devices in an array, you don't want to record that there is. Suppose I have two USB drives with a mounted but quiescent filesystem on a raid1 across them. I pull them both out, one after the other, to take them to my friends place. I plug them both in and find that the array is degraded, because as soon as I unplugged on, the other was told that it was now the only one. And, in truth, so it was. Who updated the event count though? Not good. Best to wait for an IO request that actually returns an errors. Ah, now would that be a good time to update the event count? Maybe you should allow drives to be removed even if they aren't faulty or spare? A write to a removed device would mark it faulty in the other devices without waiting for a timeout. But joggling a usb stick (similar to your use case) would probably be OK since it would be hot-removed and then hot-added. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 00/16] raid acceleration and asynchronous offload api for 2.6.22
Ronen Shitrit wrote: The resync numbers you sent, looks very promising :) Do you have any performance numbers that you can share for these set of patches, which shows the Rd/Wr IO bandwidth. I have some simple tests made with hdparm, with the results I don't understand. We see hdparm results are fine if we access the whole device: thecus:~# hdparm -Tt /dev/sdd /dev/sdd: Timing cached reads: 392 MB in 2.00 seconds = 195.71 MB/sec Timing buffered disk reads: 146 MB in 3.01 seconds = 48.47 MB/sec But are 10 times worse (Timing buffered disk reads) when we access partitions: thecus:/# hdparm -Tt /dev/sdc1 /dev/sdd1 /dev/sdc1: Timing cached reads: 396 MB in 2.01 seconds = 197.18 MB/sec Timing buffered disk reads: 16 MB in 3.32 seconds = 4.83 MB/sec /dev/sdd1: Timing cached reads: 394 MB in 2.00 seconds = 196.89 MB/sec Timing buffered disk reads: 16 MB in 3.13 seconds = 5.11 MB/sec Why is it so much worse? I used 2.6.21-iop1 patches from http://sf.net/projects/xscaleiop; right now I use 2.6.17-iop1, for which the results are ~35 MB/s when accessing a device (/dev/sdd) or a partition (/dev/sdd1). In kernel config, I enabled Intel DMA engines. The device I use is Thecus n4100, it is Platform: IQ31244 (XScale), and has 600 MHz CPU. -- Tomasz Chmielewski http://wpkg.org - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)
On May 9 2007 18:51, Linus Torvalds wrote: (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is probably some strange mixup of Andrew Morton and Andi Kleen in your mind ;) What do the letters kp stand for? Jan -- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)
On Thu, 2007-05-10 at 16:51 +0200, Jan Engelhardt wrote: (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is probably some strange mixup of Andrew Morton and Andi Kleen in your mind ;) What do the letters kp stand for? Keep Patching ? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/16] raid acceleration and asynchronous offload api for 2.6.22
Tomasz Chmielewski schrieb: Ronen Shitrit wrote: The resync numbers you sent, looks very promising :) Do you have any performance numbers that you can share for these set of patches, which shows the Rd/Wr IO bandwidth. I have some simple tests made with hdparm, with the results I don't understand. We see hdparm results are fine if we access the whole device: thecus:~# hdparm -Tt /dev/sdd /dev/sdd: Timing cached reads: 392 MB in 2.00 seconds = 195.71 MB/sec Timing buffered disk reads: 146 MB in 3.01 seconds = 48.47 MB/sec But are 10 times worse (Timing buffered disk reads) when we access partitions: There seems to be another side effect when comparing DMA engine in 2.6.17-iop1 to 2.6.21-iop1: network performance. For simple network tests, I use netperf tool to measure network performance. With 2.6.17-iop1 and all DMA offloading options enabled (selectable in System type --- IOP3xx Implementation Options ---), I get nearly 25 MB/s throughput. With 2.6.21-iop1 and all DMA offloading optons enabled (moved to Device Drivers --- DMA Engine support ---), I get only about 10 MB/s throughput. Additionally, on 2.6.21-iop1, I get lots of dma_cookie 0 printed by the kernel. -- Tomasz Chmielewski http://wpkg.org - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)
On 5/10/07, Xavier Bestel [EMAIL PROTECTED] wrote: On Thu, 2007-05-10 at 16:51 +0200, Jan Engelhardt wrote: (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is probably some strange mixup of Andrew Morton and Andi Kleen in your mind ;) What do the letters kp stand for? Heh ... I've always wanted to know that myself. It's funny, no one seems to have asked that on lkml during all these years (at least none that a Google search would throw up). Keep Patching ? Unlikely. akpm seems to be a pre-Linux-kernel nick. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: removed disk md-device
On Thursday 10 May 2007 09:12:54 Neil Brown wrote: On Wednesday May 9, [EMAIL PROTECTED] wrote: Neil Brown [EMAIL PROTECTED] [2007.04.02.0953 +0200]: Hmmm... this is somewhat awkward. You could argue that udev should be taught to remove the device from the array before removing the device from /dev. But I'm not convinced that you always want to 'fail' the device. It is possible in this case that the array is quiescent and you might like to shut it down without registering a device failure... Hmm, the the kernel advised hotplug to remove the device from /dev, but you don't want to remove it from md? Do you have an example for that case? Until there is known to be an inconsistency among the devices in an array, you don't want to record that there is. Suppose I have two USB drives with a mounted but quiescent filesystem on a raid1 across them. I pull them both out, one after the other, to take them to my friends place. I plug them both in and find that the array is degraded, because as soon as I unplugged on, the other was told that it was now the only one. Not good. Best to wait for an IO request that actually returns an errors. Ok, keeping the raid working in this case would be a good idea, so we would need it configurable if it should degrade or not. However, have you tested if pulling and hotplugging the drive works? Actually thats what our customer did. As long as md keeps the old device information, the re-plugged-in device will get another device name (and of course also another major number) and so the md-device will still keeps the old device information and it will never automagically add the new device. Probably thats even a good idea, how should the md-layer know if it is really the very same device and even if it would know that, how should it know that no data have been modified on it, while it was plugged out? Maybe an mdadm command that will do that for a given device, or for all components of a given array if the 'dev' link is 'broken', or even for all devices for all array. mdadm --fail-unplugged --scan or mdadm --fail-unplugged /dev/md3 Ok, so one could run this as cron script. Neil, may I ask if you already started to work on this? Since we have the problem on a customer system, we should fix it ASAP, but at least within the next 2 or 3 weeks. If you didn't start work on it yet, I will do... No, I haven't, but it is getting near the top of my list. If you want a script that does this automatically for every array, something like: I have never looked into the mdadm sources before, but I will try during the weekend (without any promises). for a in /sys/block/md*/md/dev-* do if [ -f $a/block/dev ] then : still there else echo faulty $a/state echo remove $a/state fi done should do what you want. (I haven't tested it though). Thanks a lot, we will test that here. Do you propose the same logic for mdadm? Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Chaining sg lists for big I/O commands: Question
On May 9 2007 15:38, Jens Axboe wrote: I am a mdadm/disk/hard drive fanatic, I was curious: On i386, we can at most fit 256 scatterlist elements into a page, and on x86-64 we are stuck with 128. So that puts us somewhere between 512kb and 1024kb for a single IO. How come 32bit is 256 and 64 is only 128? I am sure it is something very fundamental/simple but I was curious, I would think x86_64 would fit/support more scatterlists in a page. Because of the size of the scatterlist structure. As pointers are bigger on 64-bit archs, the scatterlist structure ends up being bigger. The page size on x86-64 is 4kb, hence the number of structures you can fit in a page is smaller. I take it this problem goes away on arches with 8KB page_size? Jan -- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)
On Thu, 10 May 2007 16:51:31 +0200 (MEST) Jan Engelhardt [EMAIL PROTECTED] wrote: On May 9 2007 18:51, Linus Torvalds wrote: (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is probably some strange mixup of Andrew Morton and Andi Kleen in your mind ;) What do the letters kp stand for? Some say Kernel Programmer. My parents said Keith Paul. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Questions about the speed when MD-RAID array is being initialized.
Hi, I created a MD-RAID5 array using 8 Maxtor SAS Disk Drives (chunk size is 256k). I have measured the data transfer speed for single SAS disk drive (physical drive, not filesystem on it), it is roughly about 80~90MB/s. However, I notice MD also reports the speed for the RAID5 array when it is being initialized (cat /proc/mdstat). The speed reported by MD is not constant which is roughly from 70MB/s to 90MB/s (average is 85MB/s which is very close to the single disk data transfer speed). I just have three questions: 1. What is the exact meaning of the array speed reported by MD? Is that mesured for the whole array (I used 8 disks) or for just single underlying disk? If it is for the whole array, then 70~90B/s seems too low considering 8 disks are used for this array. 2. How is this speed measured and what is the I/O packet size being used when the speed is measured? 3. From the beginning when MD-RAID 5 array is initialized to the end when the intialization is done, the speed reports by MD gradually decrease from 90MB/s down to 70MB/s. Why does the speed change? Why does the speed gradually decrease? Could anyone give me some explanation? I'm using RHEL 4U4 with 2.6.18 kernel. MDADM version is 1.6. Thanks a lot, Liang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions about the speed when MD-RAID array is being initialized.
On Thu, 10 May 2007, Liang Yang wrote: Hi, I created a MD-RAID5 array using 8 Maxtor SAS Disk Drives (chunk size is 256k). I have measured the data transfer speed for single SAS disk drive (physical drive, not filesystem on it), it is roughly about 80~90MB/s. However, I notice MD also reports the speed for the RAID5 array when it is being initialized (cat /proc/mdstat). The speed reported by MD is not constant which is roughly from 70MB/s to 90MB/s (average is 85MB/s which is very close to the single disk data transfer speed). I just have three questions: 1. What is the exact meaning of the array speed reported by MD? Is that mesured for the whole array (I used 8 disks) or for just single underlying disk? If it is for the whole array, then 70~90B/s seems too low considering 8 disks are used for this array. 2. How is this speed measured and what is the I/O packet size being used when the speed is measured? 3. From the beginning when MD-RAID 5 array is initialized to the end when the intialization is done, the speed reports by MD gradually decrease from 90MB/s down to 70MB/s. Why does the speed change? Why does the speed gradually decrease? Could anyone give me some explanation? I'm using RHEL 4U4 with 2.6.18 kernel. MDADM version is 1.6. Thanks a lot, Liang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html For no 3. because it starts from the fast end of the disk and works its way to the slower part (slower speeds). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions about the speed when MD-RAID array is being initialized.
Could you please give me more details about this? What do you mean the fast end and slow end part of disk? Do you mean the location in each disk platter? Thanks, Liang - Original Message - From: Justin Piszcz [EMAIL PROTECTED] To: Liang Yang [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Thursday, May 10, 2007 2:33 PM Subject: Re: Questions about the speed when MD-RAID array is being initialized. On Thu, 10 May 2007, Liang Yang wrote: Hi, I created a MD-RAID5 array using 8 Maxtor SAS Disk Drives (chunk size is 256k). I have measured the data transfer speed for single SAS disk drive (physical drive, not filesystem on it), it is roughly about 80~90MB/s. However, I notice MD also reports the speed for the RAID5 array when it is being initialized (cat /proc/mdstat). The speed reported by MD is not constant which is roughly from 70MB/s to 90MB/s (average is 85MB/s which is very close to the single disk data transfer speed). I just have three questions: 1. What is the exact meaning of the array speed reported by MD? Is that mesured for the whole array (I used 8 disks) or for just single underlying disk? If it is for the whole array, then 70~90B/s seems too low considering 8 disks are used for this array. 2. How is this speed measured and what is the I/O packet size being used when the speed is measured? 3. From the beginning when MD-RAID 5 array is initialized to the end when the intialization is done, the speed reports by MD gradually decrease from 90MB/s down to 70MB/s. Why does the speed change? Why does the speed gradually decrease? Could anyone give me some explanation? I'm using RHEL 4U4 with 2.6.18 kernel. MDADM version is 1.6. Thanks a lot, Liang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html For no 3. because it starts from the fast end of the disk and works its way to the slower part (slower speeds). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions about the speed when MD-RAID array is being initialized.
http://partition.radified.com/partitioning_2.htm System and program files that wind up at the far end of the drive take longer to access, and are transferred at a slower rate, which translates into a less-responsive system. If you look at the graph of sustained transfer rates (STRs) from the HD Tach benchmark posted here, you'll see clearly that the outermost sectors of the drive transfer data the fastest. On Thu, 10 May 2007, Liang Yang wrote: Could you please give me more details about this? What do you mean the fast end and slow end part of disk? Do you mean the location in each disk platter? Thanks, Liang - Original Message - From: Justin Piszcz [EMAIL PROTECTED] To: Liang Yang [EMAIL PROTECTED] Cc: linux-raid@vger.kernel.org Sent: Thursday, May 10, 2007 2:33 PM Subject: Re: Questions about the speed when MD-RAID array is being initialized. On Thu, 10 May 2007, Liang Yang wrote: Hi, I created a MD-RAID5 array using 8 Maxtor SAS Disk Drives (chunk size is 256k). I have measured the data transfer speed for single SAS disk drive (physical drive, not filesystem on it), it is roughly about 80~90MB/s. However, I notice MD also reports the speed for the RAID5 array when it is being initialized (cat /proc/mdstat). The speed reported by MD is not constant which is roughly from 70MB/s to 90MB/s (average is 85MB/s which is very close to the single disk data transfer speed). I just have three questions: 1. What is the exact meaning of the array speed reported by MD? Is that mesured for the whole array (I used 8 disks) or for just single underlying disk? If it is for the whole array, then 70~90B/s seems too low considering 8 disks are used for this array. 2. How is this speed measured and what is the I/O packet size being used when the speed is measured? 3. From the beginning when MD-RAID 5 array is initialized to the end when the intialization is done, the speed reports by MD gradually decrease from 90MB/s down to 70MB/s. Why does the speed change? Why does the speed gradually decrease? Could anyone give me some explanation? I'm using RHEL 4U4 with 2.6.18 kernel. MDADM version is 1.6. Thanks a lot, Liang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html For no 3. because it starts from the fast end of the disk and works its way to the slower part (slower speeds). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Chaining sg lists for big I/O commands: Question
On Thu, May 10 2007, Jan Engelhardt wrote: On May 9 2007 15:38, Jens Axboe wrote: I am a mdadm/disk/hard drive fanatic, I was curious: On i386, we can at most fit 256 scatterlist elements into a page, and on x86-64 we are stuck with 128. So that puts us somewhere between 512kb and 1024kb for a single IO. How come 32bit is 256 and 64 is only 128? I am sure it is something very fundamental/simple but I was curious, I would think x86_64 would fit/support more scatterlists in a page. Because of the size of the scatterlist structure. As pointers are bigger on 64-bit archs, the scatterlist structure ends up being bigger. The page size on x86-64 is 4kb, hence the number of structures you can fit in a page is smaller. I take it this problem goes away on arches with 8KB page_size? Not really, the 8kb page size just doubles the sg size. On a 64-bit arch, that would still only get you 1mb IO size. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions about the speed when MD-RAID array is being initialized.
On Thu May 10, 2007 at 05:33:17PM -0400, Justin Piszcz wrote: On Thu, 10 May 2007, Liang Yang wrote: Hi, I created a MD-RAID5 array using 8 Maxtor SAS Disk Drives (chunk size is 256k). I have measured the data transfer speed for single SAS disk drive (physical drive, not filesystem on it), it is roughly about 80~90MB/s. However, I notice MD also reports the speed for the RAID5 array when it is being initialized (cat /proc/mdstat). The speed reported by MD is not constant which is roughly from 70MB/s to 90MB/s (average is 85MB/s which is very close to the single disk data transfer speed). I just have three questions: 1. What is the exact meaning of the array speed reported by MD? Is that mesured for the whole array (I used 8 disks) or for just single underlying disk? If it is for the whole array, then 70~90B/s seems too low considering 8 disks are used for this array. 2. How is this speed measured and what is the I/O packet size being used when the speed is measured? 3. From the beginning when MD-RAID 5 array is initialized to the end when the intialization is done, the speed reports by MD gradually decrease from 90MB/s down to 70MB/s. Why does the speed change? Why does the speed gradually decrease? Could anyone give me some explanation? I'm using RHEL 4U4 with 2.6.18 kernel. MDADM version is 1.6. Thanks a lot, Liang For no 3. because it starts from the fast end of the disk and works its way to the slower part (slower speeds). And I'd assume for no 1 it's because it's only writing to a single disk at this point, so will obviously be limited to the transfer rate of a single disk. RAID5 arrays are created as a degraded array, then the final disk is recovered - this is done so that the array is ready for use very quickly. So what you're seeing in /proc/mdstat is the speed in calculating and writing the data for the final drive (and is, unless computationally limited, going to be the write speed of the single drive). HTH, Robin -- ___ ( ' } | Robin Hill[EMAIL PROTECTED] | / / ) | Little Jim says | // !! | He fallen in de water !! | pgpuxGkc8VmMd.pgp Description: PGP signature
Re: Please revert 5b479c91da90eef605f851508744bfe8269591a0 (md partition rescan)
Satyam Sharma wrote: On 5/10/07, Xavier Bestel [EMAIL PROTECTED] wrote: On Thu, 2007-05-10 at 16:51 +0200, Jan Engelhardt wrote: (But Andrew never saw your email, I suspect: [EMAIL PROTECTED] is probably some strange mixup of Andrew Morton and Andi Kleen in your mind ;) What do the letters kp stand for? Heh ... I've always wanted to know that myself. It's funny, no one seems to have asked that on lkml during all these years (at least none that a Google search would throw up). Keep Patching ? Unlikely. akpm seems to be a pre-Linux-kernel nick. http://en.wikipedia.org/wiki/Andrew_Morton_%28computer_programmer%29 -hpa - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: removed disk md-device
On Thursday May 10, [EMAIL PROTECTED] wrote: Neil Brown wrote: On Wednesday May 9, [EMAIL PROTECTED] wrote: Neil Brown [EMAIL PROTECTED] [2007.04.02.0953 +0200]: Hmmm... this is somewhat awkward. You could argue that udev should be taught to remove the device from the array before removing the device from /dev. But I'm not convinced that you always want to 'fail' the device. It is possible in this case that the array is quiescent and you might like to shut it down without registering a device failure... Hmm, the the kernel advised hotplug to remove the device from /dev, but you don't want to remove it from md? Do you have an example for that case? Until there is known to be an inconsistency among the devices in an array, you don't want to record that there is. Suppose I have two USB drives with a mounted but quiescent filesystem on a raid1 across them. I pull them both out, one after the other, to take them to my friends place. I plug them both in and find that the array is degraded, because as soon as I unplugged on, the other was told that it was now the only one. And, in truth, so it was. So what was? It is true that now one drive is the only one plugged in, but is that relevant? Is it true that the one drive is the only drive in the array?? That depends on what you mean by the array. If I am moving the array to another computer, then the one drive still plugged into the first computer is not the only drive in the array from my perspective. If there is a write request, and it can only be written to one drive (because the other is unplugged), then it becomes appropriate to tell the still-present drive that it is the only drive in the array. Who updated the event count though? Sorry, not enough words. I don't know what you are asking. Not good. Best to wait for an IO request that actually returns an errors. Ah, now would that be a good time to update the event count? Yes. Of course. It is an event (IO failed). That makes it a good time to update the event count.. am I missing something here? Maybe you should allow drives to be removed even if they aren't faulty or spare? A write to a removed device would mark it faulty in the other devices without waiting for a timeout. Maybe, but I'm not sure what the real gain would be. But joggling a usb stick (similar to your use case) would probably be OK since it would be hot-removed and then hot-added. This still needs user-space interaction. If the USB layer detects a removal and a re-insert, sdb may well come back a something different (sdp?) - though I'm not completely familiar with how USB storage works. In any case, it should really be a user-space decision what happens then. A hot re-add may well be appropriate, but I wouldn't want to have the kernel make that decision. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html