3ware and dmraid - duplicate serial numbers (!)
Hi everyone, One of my boxes crashed (with a hardware error, I think - CPU and motherboard replacements are on their way). I booted it up on a rescue disk (Fedora 8) to let the software raid sync up. When it was running I noticed that one of the disks were listed as dm-5 and ... uh-oh ... there was a disk missing. I figured out that the multipath stuff for some reason had setup two of the disks as /dev/mapper/mpath0 and now md was syncing to this device. Much later I figured out that dmraid -b reported two of the disks as being the same: /dev/sda:976541696 total, W553841781E0A2001842 /dev/sdb:976773168 total, V600VXZG /dev/sdc:586114704 total, U1757241 /dev/sdd:976773168 total, U1907712 /dev/sde:976773168 total, U2133609 /dev/sdf:976773168 total, D2994402 /dev/sdg:625140335 total, U2130349 /dev/sdh:976773168 total, U1541228 /dev/sdi:976771055 total, W5267124 /dev/sdj:976773168 total, U1409513 /dev/sdk:976773168 total, U1409513 Any idea how this could happen? All 11 disks are on a 3ware 9650 controller (the first one is a single 3ware device, the rest are JBOD). I tried rebooting and booting on a Fedora 7 DVD with the same result. [ all this of course seems to have messed up my raid10 badly -- more on that tomorrow ] - ask -- http://develooper.com/ - http://askask.com/ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3ware and erroneous multipathing - duplicate serial numbers (was: 3ware and dmraid)
On Jan 18, 2008, at 3:17 AM, Ask Bjørn Hansen wrote: [ Uh, I just realized that I forgot to update the subject line as I figured out what was going on; it's obviously not a software raid problem but a multipath problem ] One of my boxes crashed (with a hardware error, I think - CPU and motherboard replacements are on their way). I booted it up on a rescue disk (Fedora 8) to let the software raid sync up. When it was running I noticed that one of the disks were listed as dm-5 and ... uh-oh ... there was a disk missing. I figured out that the multipath stuff for some reason had setup two of the disks as /dev/mapper/mpath0 and now md was syncing to this device. Much later I figured out that dmraid -b reported two of the disks as being the same: /dev/sda:976541696 total, W553841781E0A2001842 /dev/sdb:976773168 total, V600VXZG /dev/sdc:586114704 total, U1757241 /dev/sdd:976773168 total, U1907712 /dev/sde:976773168 total, U2133609 /dev/sdf:976773168 total, D2994402 /dev/sdg:625140335 total, U2130349 /dev/sdh:976773168 total, U1541228 /dev/sdi:976771055 total, W5267124 /dev/sdj:976773168 total, U1409513 /dev/sdk:976773168 total, U1409513 Any idea how this could happen? All 11 disks are on a 3ware 9650 controller (the first one is a single 3ware device, the rest are JBOD). I tried rebooting and booting on a Fedora 7 DVD with the same result. [ all this of course seems to have messed up my raid10 badly -- more on that tomorrow ] - ask -- http://develooper.com/ - http://askask.com/ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 004 of 4] md: Fix an occasional deadlock in raid5 - FIX
(This should be merged with fix-occasional-deadlock-in-raid5.patch) As we don't call stripe_handle in make_request any more, we need to clear STRIPE_DELAYED to (previously done by stripe_handle) to ensure that we test if the stripe still needs to be delayed or not. Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/raid5.c |1 + 1 file changed, 1 insertion(+) diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c --- .prev/drivers/md/raid5.c2008-01-18 14:58:55.0 +1100 +++ ./drivers/md/raid5.c2008-01-18 14:59:53.0 +1100 @@ -3549,6 +3549,7 @@ static int make_request(struct request_q } finish_wait(conf-wait_for_overlap, w); set_bit(STRIPE_HANDLE, sh-state); + clear_bit(STRIPE_DELAYED, sh-state); release_stripe(sh); } else { /* cannot get stripe for read-ahead, just give-up */ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 003 of 4] md: Change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING.
Finish ITERATE_ to for_each conversion. Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c |8 ./include/linux/raid/md_k.h | 14 -- 2 files changed, 8 insertions(+), 14 deletions(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2008-01-18 11:19:09.0 +1100 +++ ./drivers/md/md.c 2008-01-18 11:19:24.0 +1100 @@ -3766,7 +3766,7 @@ static void autorun_devices(int part) printk(KERN_INFO md: considering %s ...\n, bdevname(rdev0-bdev,b)); INIT_LIST_HEAD(candidates); - ITERATE_RDEV_PENDING(rdev,tmp) + rdev_for_each_list(rdev, tmp, pending_raid_disks) if (super_90_load(rdev, rdev0, 0) = 0) { printk(KERN_INFO md: adding %s ...\n, bdevname(rdev-bdev,b)); @@ -3810,7 +3810,7 @@ static void autorun_devices(int part) } else { printk(KERN_INFO md: created %s\n, mdname(mddev)); mddev-persistent = 1; - ITERATE_RDEV_GENERIC(candidates,rdev,tmp) { + rdev_for_each_list(rdev, tmp, candidates) { list_del_init(rdev-same_set); if (bind_rdev_to_array(rdev, mddev)) export_rdev(rdev); @@ -3821,7 +3821,7 @@ static void autorun_devices(int part) /* on success, candidates will be empty, on error * it won't... */ - ITERATE_RDEV_GENERIC(candidates,rdev,tmp) + rdev_for_each_list(rdev, tmp, candidates) export_rdev(rdev); mddev_put(mddev); } @@ -4936,7 +4936,7 @@ static void status_unused(struct seq_fil seq_printf(seq, unused devices: ); - ITERATE_RDEV_PENDING(rdev,tmp) { + rdev_for_each_list(rdev, tmp, pending_raid_disks) { char b[BDEVNAME_SIZE]; i++; seq_printf(seq, %s , diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h --- .prev/include/linux/raid/md_k.h 2008-01-18 11:19:09.0 +1100 +++ ./include/linux/raid/md_k.h 2008-01-18 11:19:24.0 +1100 @@ -313,23 +313,17 @@ static inline char * mdname (mddev_t * m * iterates through some rdev ringlist. It's safe to remove the * current 'rdev'. Dont touch 'tmp' though. */ -#define ITERATE_RDEV_GENERIC(head,rdev,tmp)\ +#define rdev_for_each_list(rdev, tmp, list)\ \ - for ((tmp) = (head).next; \ + for ((tmp) = (list).next; \ (rdev) = (list_entry((tmp), mdk_rdev_t, same_set)), \ - (tmp) = (tmp)-next, (tmp)-prev != (head) \ + (tmp) = (tmp)-next, (tmp)-prev != (list) \ ; ) /* * iterates through the 'same array disks' ringlist */ #define rdev_for_each(rdev, tmp, mddev)\ - ITERATE_RDEV_GENERIC((mddev)-disks,rdev,tmp) - -/* - * Iterates through 'pending RAID disks' - */ -#define ITERATE_RDEV_PENDING(rdev,tmp) \ - ITERATE_RDEV_GENERIC(pending_raid_disks,rdev,tmp) + rdev_for_each_list(rdev, tmp, (mddev)-disks) typedef struct mdk_thread_s { void(*run) (mddev_t *mddev); - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 002 of 4] md: Allow devices to be shared between md arrays.
Currently, a given device is claimed by a particular array so that it cannot be used by other arrays. This is not ideal for DDF and other metadata schemes which have their own partitioning concept. So for externally managed metadata, just claim the device for md in general, require that offset and size are set properly for each device, and make sure that if a device is included in different arrays then the active sections do not overlap. This involves adding another flag to the rdev which makes it awkward to set -flags = 0 to clear certain flags. So now clear flags explicitly by name when we want to clear things. Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c | 88 +++- ./include/linux/raid/md_k.h |2 + 2 files changed, 80 insertions(+), 10 deletions(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2008-01-18 11:03:15.0 +1100 +++ ./drivers/md/md.c 2008-01-18 11:18:04.0 +1100 @@ -774,7 +774,11 @@ static int super_90_validate(mddev_t *md __u64 ev1 = md_event(sb); rdev-raid_disk = -1; - rdev-flags = 0; + clear_bit(Faulty, rdev-flags); + clear_bit(In_sync, rdev-flags); + clear_bit(WriteMostly, rdev-flags); + clear_bit(BarriersNotsupp, rdev-flags); + if (mddev-raid_disks == 0) { mddev-major_version = 0; mddev-minor_version = sb-minor_version; @@ -1154,7 +1158,11 @@ static int super_1_validate(mddev_t *mdd __u64 ev1 = le64_to_cpu(sb-events); rdev-raid_disk = -1; - rdev-flags = 0; + clear_bit(Faulty, rdev-flags); + clear_bit(In_sync, rdev-flags); + clear_bit(WriteMostly, rdev-flags); + clear_bit(BarriersNotsupp, rdev-flags); + if (mddev-raid_disks == 0) { mddev-major_version = 1; mddev-patch_version = 0; @@ -1402,7 +1410,7 @@ static int bind_rdev_to_array(mdk_rdev_t goto fail; } list_add(rdev-same_set, mddev-disks); - bd_claim_by_disk(rdev-bdev, rdev, mddev-gendisk); + bd_claim_by_disk(rdev-bdev, rdev-bdev-bd_holder, mddev-gendisk); return 0; fail: @@ -1442,7 +1450,7 @@ static void unbind_rdev_from_array(mdk_r * otherwise reused by a RAID array (or any other kernel * subsystem), by bd_claiming the device. */ -static int lock_rdev(mdk_rdev_t *rdev, dev_t dev) +static int lock_rdev(mdk_rdev_t *rdev, dev_t dev, int shared) { int err = 0; struct block_device *bdev; @@ -1454,13 +1462,15 @@ static int lock_rdev(mdk_rdev_t *rdev, d __bdevname(dev, b)); return PTR_ERR(bdev); } - err = bd_claim(bdev, rdev); + err = bd_claim(bdev, shared ? (mdk_rdev_t *)lock_rdev : rdev); if (err) { printk(KERN_ERR md: could not bd_claim %s.\n, bdevname(bdev, b)); blkdev_put(bdev); return err; } + if (!shared) + set_bit(AllReserved, rdev-flags); rdev-bdev = bdev; return err; } @@ -1925,7 +1935,8 @@ slot_store(mdk_rdev_t *rdev, const char return -ENOSPC; rdev-raid_disk = slot; /* assume it is working */ - rdev-flags = 0; + clear_bit(Faulty, rdev-flags); + clear_bit(WriteMostly, rdev-flags); set_bit(In_sync, rdev-flags); } return len; @@ -1950,6 +1961,10 @@ offset_store(mdk_rdev_t *rdev, const cha return -EINVAL; if (rdev-mddev-pers) return -EBUSY; + if (rdev-size rdev-mddev-external) + /* Must set offset before size, so overlap checks +* can be sane */ + return -EBUSY; rdev-data_offset = offset; return len; } @@ -1963,16 +1978,69 @@ rdev_size_show(mdk_rdev_t *rdev, char *p return sprintf(page, %llu\n, (unsigned long long)rdev-size); } +static int overlaps(sector_t s1, sector_t l1, sector_t s2, sector_t l2) +{ + /* check if two start/length pairs overlap */ + if (s1+l1 = s2) + return 0; + if (s2+l2 = s1) + return 0; + return 1; +} + static ssize_t rdev_size_store(mdk_rdev_t *rdev, const char *buf, size_t len) { char *e; unsigned long long size = simple_strtoull(buf, e, 10); + unsigned long long oldsize = rdev-size; if (e==buf || (*e *e != '\n')) return -EINVAL; if (rdev-mddev-pers) return -EBUSY; rdev-size = size; + if (size oldsize rdev-mddev-external) { + /* need to check that all other rdevs with the same -bdev +* do not overlap. We need to unlock the mddev to avoid +* a deadlock. We have already changed rdev-size, and if +
[PATCH 000 of 4] md: assorted md patched - please read carefully.
Following are 4 patches for md. The first two replace md-allow-devices-to-be-shared-between-md-arrays.patch which was recently remove. They should go at the same place in the series, between md-allow-a-maximum-extent-to-be-set-for-resyncing.patch and md-lock-address-when-changing-attributes-of-component-devices.patch The third is a replacement for md-change-iterate_rdev_generic-to-rdev_for_each_list-and-remove-iterate_rdev_pending.patch which conflicts with the above change. The last is a fix for md-fix-an-occasional-deadlock-in-raid5.patch which makes me a lot happier about this patch. It introduced a performance regression and I now understand why. I'm now happy for that patch with this fix to go into 2.6.24 if that is convenient (If not, 2.6.24.1 will do). Thanks, NeilBrown [PATCH 001 of 4] md: Set and test the -persistent flag for md devices more consistently. [PATCH 002 of 4] md: Allow devices to be shared between md arrays. [PATCH 003 of 4] md: Change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING. [PATCH 004 of 4] md: Fix an occasional deadlock in raid5 - FIX - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 001 of 4] md: Set and test the -persistent flag for md devices more consistently.
If you try to start an array for which the number of raid disks is listed as zero, md will currently try to read metadata off any devices that have been given. This was done because the value of raid_disks is used to signal whether array details have been provided by userspace (raid_disks 0) or must be read from the devices (raid_disks == 0). However for an array without persistent metadata (or with externally managed metadata) this is the wrong thing to do. So we add a test in do_md_run to give an error if raid_disks is zero for non-persistent arrays. This requires that mddev-persistent is set corrently at this point, which it currently isn't for in-kernel autodetected arrays. So set -persistent for autodetect arrays, and remove the settign in super_*_validate which is now redundant. Also clear -persistent when stopping an array so it is consistently zero when starting an array. Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c |9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2008-01-18 10:46:49.0 +1100 +++ ./drivers/md/md.c 2008-01-18 11:03:15.0 +1100 @@ -779,7 +779,6 @@ static int super_90_validate(mddev_t *md mddev-major_version = 0; mddev-minor_version = sb-minor_version; mddev-patch_version = sb-patch_version; - mddev-persistent = 1; mddev-external = 0; mddev-chunk_size = sb-chunk_size; mddev-ctime = sb-ctime; @@ -1159,7 +1158,6 @@ static int super_1_validate(mddev_t *mdd if (mddev-raid_disks == 0) { mddev-major_version = 1; mddev-patch_version = 0; - mddev-persistent = 1; mddev-external = 0; mddev-chunk_size = le32_to_cpu(sb-chunksize) 9; mddev-ctime = le64_to_cpu(sb-ctime) ((1ULL 32)-1); @@ -3219,8 +3217,11 @@ static int do_md_run(mddev_t * mddev) /* * Analyze all RAID superblock(s) */ - if (!mddev-raid_disks) + if (!mddev-raid_disks) { + if (!mddev-persistent) + return -EINVAL; analyze_sbs(mddev); + } chunk_size = mddev-chunk_size; @@ -3627,6 +3628,7 @@ static int do_md_stop(mddev_t * mddev, i mddev-resync_max = MaxSector; mddev-reshape_position = MaxSector; mddev-external = 0; + mddev-persistent = 0; } else if (mddev-pers) printk(KERN_INFO md: %s switched to read-only mode.\n, @@ -3735,6 +3737,7 @@ static void autorun_devices(int part) mddev_unlock(mddev); } else { printk(KERN_INFO md: created %s\n, mdname(mddev)); + mddev-persistent = 1; ITERATE_RDEV_GENERIC(candidates,rdev,tmp) { list_del_init(rdev-same_set); if (bind_rdev_to_array(rdev, mddev)) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Fri, 18 Jan 2008, Bill Davidsen wrote: Justin Piszcz wrote: On Thu, 17 Jan 2008, Al Boldi wrote: Justin Piszcz wrote: On Wed, 16 Jan 2008, Al Boldi wrote: Also, can you retest using dd with different block-sizes? I can do this, moment.. I know about oflag=direct but I choose to use dd with sync and measure the total time it takes. /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' So I was asked on the mailing list to test dd with various chunk sizes, here is the length of time it took to write 10 GiB and sync per each chunk size: 4=chunk.txt:0:25.46 8=chunk.txt:0:25.63 16=chunk.txt:0:25.26 32=chunk.txt:0:25.08 64=chunk.txt:0:25.55 128=chunk.txt:0:25.26 256=chunk.txt:0:24.72 512=chunk.txt:0:24.71 1024=chunk.txt:0:25.40 2048=chunk.txt:0:25.71 4096=chunk.txt:0:27.18 8192=chunk.txt:0:29.00 16384=chunk.txt:0:31.43 32768=chunk.txt:0:50.11 65536=chunk.txt:2:20.80 What do you get with bs=512,1k,2k,4k,8k,16k... Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html root 4621 0.0 0.0 12404 760 pts/2D+ 17:53 0:00 mdadm -S /dev/md3 root 4664 0.0 0.0 4264 728 pts/5S+ 17:54 0:00 grep D Tried to stop it when it was re-syncing, DEADLOCK :( [ 305.464904] md: md3 still in use. [ 314.595281] md: md_do_sync() got signal ... exiting Anyhow, done testing, time to move data back on if I can kill the resync process w/out deadlock. So does that indicate that there is still a deadlock issue, or that you don't have the latest patches installed? -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark I was trying to stop the raid when it was building, vanilla 2.6.23.14. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
Also, don't use ext*, XFS can be up to 2-3x faster (in many of the benchmarks). I'm going to swap file systems and give it a shot right now! :) How is stability of XFS? I heard recovery is easier with ext2/3 due to more people using it, more tools available, etc? Greg - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks ... for real now
Quoting Norman Elton [EMAIL PROTECTED]: I posed the question a few weeks ago about how to best accommodate software RAID over an array of 48 disks (a Sun X4500 server, a.k.a. Thumper). I appreciate all the suggestions. Well, the hardware is here. It is indeed six Marvell 88SX6081 SATA controllers, each with eight 1TB drives, for a total raw storage of 48TB. I must admit, it's quite impressive. And loud. More information about the hardware is available online... http://www.sun.com/servers/x64/x4500/arch-wp.pdf It came loaded with Solaris, configured with ZFS. Things seemed to work fine. I did not do any benchmarks, but I can revert to that configuration if necessary. Now I've loaded RHEL onto the box. For a first-shot, I've created one RAID-5 array (+ 1 spare) on each of the controllers, then used LVM to create a VolGroup across the arrays. So now I'm trying to figure out what to do with this space. So far, I've tested mke2fs on a 1TB and a 5TB LogVol. I wish RHEL would support XFS/ZFS, but for now, I'm stuck with ext3. Am I better off sticking with relatively small partitions (2-5 TB), or should I crank up the block size and go for one big partition? Impressive system. I'm curious to what the storage drives look like and how they attach to the server with that many disks? Sounds like you have some time to play around before shoving it into production. I wonder how long it would take to run an fsck on one large filesystem? Cheers, Mike - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Fri, 18 Jan 2008, Greg Cormier wrote: Also, don't use ext*, XFS can be up to 2-3x faster (in many of the benchmarks). I'm going to swap file systems and give it a shot right now! :) How is stability of XFS? I heard recovery is easier with ext2/3 due to more people using it, more tools available, etc? Greg Recovery is actually easier with XFS because the repair filesystem code is built-into the kernel (you dont need a utility to fix it)-- however, there is xfs_repair (if) the in-kernel-tree part could not fix it. I have been using it for 4-5 years? now. Also, with CoRaids (ATA over Ethernet) many of them are above 8TB and ext3 only works up to 8TB, so its not even an option any longer. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Fri, 18 Jan 2008, Greg Cormier wrote: Justin, thanks for the script. Here's my results. I ran it a few times with different tests, hence the small number of results you see here, I slowly trimmed out the obvious not-ideal sizes. Nice, we all love benchmarks!! :) System --- Athlon64 3500 2GB RAM 4x500GB WD Raid editions, raid 5. SDE is the old 4-platter version (5000YS), the others are the 3 platter version. Faster :-) Ok. /dev/sdb: Timing buffered disk reads: 240 MB in 3.00 seconds = 79.91 MB/sec /dev/sdc: Timing buffered disk reads: 248 MB in 3.01 seconds = 82.36 MB/sec /dev/sdd: Timing buffered disk reads: 248 MB in 3.02 seconds = 82.22 MB/sec /dev/sde: (older model, 4 platters instead of 3) Timing buffered disk reads: 210 MB in 3.01 seconds = 69.87 MB/sec /dev/md3: Timing buffered disk reads: 628 MB in 3.00 seconds = 209.09 MB/sec Testing --- Test was : dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync 64-chunka.txt:2:00.63 128-chunka.txt:2:00.20 256-chunka.txt:2:01.67 512-chunka.txt:2:19.90 1024-chunka.txt:2:59.32 For your configuration, a 64-256k chunk seems optimal for this, hypothetical benchmark :) Test was : Unraring multipart RAR's, 1.2 gigabytes. Source and dest drive were the raid array. 64-chunkc.txt:1:04.20 128-chunkc.txt:0:49.37 256-chunkc.txt:0:48.88 512-chunkc.txt:0:41.20 1024-chunkc.txt:0:40.82 1 meg looks like its the best, which is what I use today, 1 MiB chunk offers the best peformance by far, at least with all of my testing (with big files) such as the tests you performed. So, there's a toss up between 256 and 512. Yeah for DD performance, not real-life. If I'm interpreting correctly here, raw throughput is better with 256, but 512 seems to work better with real-world stuff? Look above, 1 MiB got you the fastest unrar time. I'll try to think up another test or two perhaps, and removing 64 as one of the possible options to save time (mke2fs takes a while on 1.5TB) Also, don't use ext*, XFS can be up to 2-3x faster (in many of the benchmarks). Next step will be playing with read aheads and stripe cache sizes I guess! I'm open to any comments/suggestions you guys have! Greg - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
Justin Piszcz wrote: On Thu, 17 Jan 2008, Al Boldi wrote: Justin Piszcz wrote: On Wed, 16 Jan 2008, Al Boldi wrote: Also, can you retest using dd with different block-sizes? I can do this, moment.. I know about oflag=direct but I choose to use dd with sync and measure the total time it takes. /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' So I was asked on the mailing list to test dd with various chunk sizes, here is the length of time it took to write 10 GiB and sync per each chunk size: 4=chunk.txt:0:25.46 8=chunk.txt:0:25.63 16=chunk.txt:0:25.26 32=chunk.txt:0:25.08 64=chunk.txt:0:25.55 128=chunk.txt:0:25.26 256=chunk.txt:0:24.72 512=chunk.txt:0:24.71 1024=chunk.txt:0:25.40 2048=chunk.txt:0:25.71 4096=chunk.txt:0:27.18 8192=chunk.txt:0:29.00 16384=chunk.txt:0:31.43 32768=chunk.txt:0:50.11 65536=chunk.txt:2:20.80 What do you get with bs=512,1k,2k,4k,8k,16k... Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html root 4621 0.0 0.0 12404 760 pts/2D+ 17:53 0:00 mdadm -S /dev/md3 root 4664 0.0 0.0 4264 728 pts/5S+ 17:54 0:00 grep D Tried to stop it when it was re-syncing, DEADLOCK :( [ 305.464904] md: md3 still in use. [ 314.595281] md: md_do_sync() got signal ... exiting Anyhow, done testing, time to move data back on if I can kill the resync process w/out deadlock. So does that indicate that there is still a deadlock issue, or that you don't have the latest patches installed? -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3ware and erroneous multipathing - duplicate serial numbers (was: 3ware and dmraid)
On Jan 18, 2008, at 4:33 AM, Heinz Mauelshagen wrote: Much later I figured out that dmraid -b reported two of the disks as being the same: Looks like the md sync duplicated the metadata and dmraid just spots that duplication. You gotta remove one of the duplicates to clean this up but check first which to pick in case the sync was partial only. The event counter is the same on both; is that what I should look for? Is there a way to reset the dmraid metadata? I'm not actually using dmraid, I use regular software raid so I think I just need to reset the dmraid data... - ask -- http://develooper.com/ - http://askask.com/ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks ... for real now
I wonder how long it would take to run an fsck on one large filesystem? :) I would imagine you'd have time to order a new system, build it, and restore the backups before the fsck was done! - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3ware and erroneous multipathing - duplicate serial numbers (was: 3ware and dmraid)
On Fri, Jan 18, 2008 at 03:23:24AM -0800, Ask Bjørn Hansen wrote: On Jan 18, 2008, at 3:17 AM, Ask Bjørn Hansen wrote: [ Uh, I just realized that I forgot to update the subject line as I figured out what was going on; it's obviously not a software raid problem but a multipath problem ] One of my boxes crashed (with a hardware error, I think - CPU and motherboard replacements are on their way). I booted it up on a rescue disk (Fedora 8) to let the software raid sync up. When it was running I noticed that one of the disks were listed as dm-5 and ... uh-oh ... there was a disk missing. I figured out that the multipath stuff for some reason had setup two of the disks as /dev/mapper/mpath0 and now md was syncing to this device. Much later I figured out that dmraid -b reported two of the disks as being the same: Looks like the md sync duplicated the metadata and dmraid just spots that duplication. You gotta remove one of the duplicates to clean this up but check first which to pick in case the sync was partial only. Regards, Heinz-- The LVM Guy -- /dev/sda:976541696 total, W553841781E0A2001842 /dev/sdb:976773168 total, V600VXZG /dev/sdc:586114704 total, U1757241 /dev/sdd:976773168 total, U1907712 /dev/sde:976773168 total, U2133609 /dev/sdf:976773168 total, D2994402 /dev/sdg:625140335 total, U2130349 /dev/sdh:976773168 total, U1541228 /dev/sdi:976771055 total, W5267124 /dev/sdj:976773168 total, U1409513 /dev/sdk:976773168 total, U1409513 Any idea how this could happen? All 11 disks are on a 3ware 9650 controller (the first one is a single 3ware device, the rest are JBOD). I tried rebooting and booting on a Fedora 7 DVD with the same result. [ all this of course seems to have messed up my raid10 badly -- more on that tomorrow ] - ask -- http://develooper.com/ - http://askask.com/ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Heinz Mauelshagen Red Hat GmbH Consulting Development Engineer Am Sonnenhang 11 Storage Development 56242 Marienrachdorf Germany [EMAIL PROTECTED]PHONE +49 171 7803392 FAX +49 2626 924446 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks ... for real now
It is quite a box. There's a picture of the box with the cover removed on Sun's website: http://www.sun.com/images/k3/k3_sunfirex4500_4.jpg From the X4500 homepage, there's a gallery of additional pictures. The drives drop in from the top. Massive fans channel air in the small gaps between the drives. It doesn't look like there's much room between the disks, but a lot of cold air gets sucked in the front, and a lot of hot air comes out the back. So it must be doing its job :). I have not tried a fsck on it yet. I'll probably setup a lot of 2TB partitions rather than a single large partition. Then write the software to handle storing data across many partitions. Norman On 1/18/08, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Quoting Norman Elton [EMAIL PROTECTED]: I posed the question a few weeks ago about how to best accommodate software RAID over an array of 48 disks (a Sun X4500 server, a.k.a. Thumper). I appreciate all the suggestions. Well, the hardware is here. It is indeed six Marvell 88SX6081 SATA controllers, each with eight 1TB drives, for a total raw storage of 48TB. I must admit, it's quite impressive. And loud. More information about the hardware is available online... http://www.sun.com/servers/x64/x4500/arch-wp.pdf It came loaded with Solaris, configured with ZFS. Things seemed to work fine. I did not do any benchmarks, but I can revert to that configuration if necessary. Now I've loaded RHEL onto the box. For a first-shot, I've created one RAID-5 array (+ 1 spare) on each of the controllers, then used LVM to create a VolGroup across the arrays. So now I'm trying to figure out what to do with this space. So far, I've tested mke2fs on a 1TB and a 5TB LogVol. I wish RHEL would support XFS/ZFS, but for now, I'm stuck with ext3. Am I better off sticking with relatively small partitions (2-5 TB), or should I crank up the block size and go for one big partition? Impressive system. I'm curious to what the storage drives look like and how they attach to the server with that many disks? Sounds like you have some time to play around before shoving it into production. I wonder how long it would take to run an fsck on one large filesystem? Cheers, Mike - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
Justin, thanks for the script. Here's my results. I ran it a few times with different tests, hence the small number of results you see here, I slowly trimmed out the obvious not-ideal sizes. System --- Athlon64 3500 2GB RAM 4x500GB WD Raid editions, raid 5. SDE is the old 4-platter version (5000YS), the others are the 3 platter version. Faster :-) /dev/sdb: Timing buffered disk reads: 240 MB in 3.00 seconds = 79.91 MB/sec /dev/sdc: Timing buffered disk reads: 248 MB in 3.01 seconds = 82.36 MB/sec /dev/sdd: Timing buffered disk reads: 248 MB in 3.02 seconds = 82.22 MB/sec /dev/sde: (older model, 4 platters instead of 3) Timing buffered disk reads: 210 MB in 3.01 seconds = 69.87 MB/sec /dev/md3: Timing buffered disk reads: 628 MB in 3.00 seconds = 209.09 MB/sec Testing --- Test was : dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync 64-chunka.txt:2:00.63 128-chunka.txt:2:00.20 256-chunka.txt:2:01.67 512-chunka.txt:2:19.90 1024-chunka.txt:2:59.32 Test was : Unraring multipart RAR's, 1.2 gigabytes. Source and dest drive were the raid array. 64-chunkc.txt:1:04.20 128-chunkc.txt:0:49.37 256-chunkc.txt:0:48.88 512-chunkc.txt:0:41.20 1024-chunkc.txt:0:40.82 So, there's a toss up between 256 and 512. If I'm interpreting correctly here, raw throughput is better with 256, but 512 seems to work better with real-world stuff? I'll try to think up another test or two perhaps, and removing 64 as one of the possible options to save time (mke2fs takes a while on 1.5TB) Next step will be playing with read aheads and stripe cache sizes I guess! I'm open to any comments/suggestions you guys have! Greg - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks ... for real now
On Thu, 17 Jan 2008, Janek Kozicki wrote: I wish RHEL would support XFS/ZFS, but for now, I'm stuck with ext3. there is ext4 (or ext4dev) - it's an ext3 modified to support 1024 PB size (1048576 TB). You could check if it's feasible. Personally I'd always stick with ext2/ext3/ext4 since it is most widely used and thus has the best recovery tools. Something else to keep in mind...XFS fs repair tools require large amounts of memory. If you were to create one or a few really huge fs's on this array, you might end up with fs's which can't be repaired because you don't have or even can't get a machine with enough RAM for the job...not to mention the amount of time it would take. -- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net| _ http://www.lewis.org/~jlewis/pgp for PGP public key_ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html