3ware and dmraid - duplicate serial numbers (!)

2008-01-18 Thread Ask Bjørn Hansen

Hi everyone,

One of my boxes crashed (with a hardware error, I think - CPU and  
motherboard replacements are on their way).  I booted it up on a  
rescue disk (Fedora 8) to let the software raid sync up.


When it was running I noticed that one of the disks were listed as  
dm-5 and ... uh-oh ... there was a disk missing.   I figured out  
that the multipath stuff for some reason had setup two of the disks  
as /dev/mapper/mpath0 and now md was syncing to this device.


Much later I figured out that dmraid -b reported two of the disks as  
being the same:


/dev/sda:976541696 total, W553841781E0A2001842
/dev/sdb:976773168 total, V600VXZG
/dev/sdc:586114704 total, U1757241
/dev/sdd:976773168 total, U1907712
/dev/sde:976773168 total, U2133609
/dev/sdf:976773168 total, D2994402
/dev/sdg:625140335 total, U2130349
/dev/sdh:976773168 total, U1541228
/dev/sdi:976771055 total, W5267124
/dev/sdj:976773168 total, U1409513
/dev/sdk:976773168 total, U1409513

Any idea how this could happen?   All 11 disks are on a 3ware 9650  
controller (the first one is a single 3ware device, the rest are  
JBOD).  I tried rebooting and booting on a Fedora 7 DVD with the same  
result.


[ all this of course seems to have messed up my raid10 badly -- more  
on that tomorrow ]



 - ask

--
http://develooper.com/ - http://askask.com/


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3ware and erroneous multipathing - duplicate serial numbers (was: 3ware and dmraid)

2008-01-18 Thread Ask Bjørn Hansen


On Jan 18, 2008, at 3:17 AM, Ask Bjørn Hansen wrote:

[ Uh, I just realized that I forgot to update the subject line as I  
figured out what was going on; it's obviously not a software raid  
problem but a multipath problem ]


One of my boxes crashed (with a hardware error, I think - CPU and  
motherboard replacements are on their way).  I booted it up on a  
rescue disk (Fedora 8) to let the software raid sync up.


When it was running I noticed that one of the disks were listed as  
dm-5 and ... uh-oh ... there was a disk missing.   I figured out  
that the multipath stuff for some reason had setup two of the disks  
as /dev/mapper/mpath0 and now md was syncing to this device.


Much later I figured out that dmraid -b reported two of the disks as  
being the same:


/dev/sda:976541696 total, W553841781E0A2001842
/dev/sdb:976773168 total, V600VXZG
/dev/sdc:586114704 total, U1757241
/dev/sdd:976773168 total, U1907712
/dev/sde:976773168 total, U2133609
/dev/sdf:976773168 total, D2994402
/dev/sdg:625140335 total, U2130349
/dev/sdh:976773168 total, U1541228
/dev/sdi:976771055 total, W5267124
/dev/sdj:976773168 total, U1409513
/dev/sdk:976773168 total, U1409513

Any idea how this could happen?   All 11 disks are on a 3ware 9650  
controller (the first one is a single 3ware device, the rest are  
JBOD).  I tried rebooting and booting on a Fedora 7 DVD with the same  
result.


[ all this of course seems to have messed up my raid10 badly -- more  
on that tomorrow ]



- ask

--
http://develooper.com/ - http://askask.com/

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 004 of 4] md: Fix an occasional deadlock in raid5 - FIX

2008-01-18 Thread NeilBrown

(This should be merged with fix-occasional-deadlock-in-raid5.patch)

As we don't call stripe_handle in make_request any more, we need to
clear STRIPE_DELAYED to (previously done by stripe_handle) to ensure
that we test if the stripe still needs to be delayed or not.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid5.c |1 +
 1 file changed, 1 insertion(+)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2008-01-18 14:58:55.0 +1100
+++ ./drivers/md/raid5.c2008-01-18 14:59:53.0 +1100
@@ -3549,6 +3549,7 @@ static int make_request(struct request_q
}
finish_wait(conf-wait_for_overlap, w);
set_bit(STRIPE_HANDLE, sh-state);
+   clear_bit(STRIPE_DELAYED, sh-state);
release_stripe(sh);
} else {
/* cannot get stripe for read-ahead, just give-up */
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 003 of 4] md: Change ITERATE_RDEV_GENERIC to rdev_for_each_list, and remove ITERATE_RDEV_PENDING.

2008-01-18 Thread NeilBrown

Finish ITERATE_ to for_each conversion.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c   |8 
 ./include/linux/raid/md_k.h |   14 --
 2 files changed, 8 insertions(+), 14 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2008-01-18 11:19:09.0 +1100
+++ ./drivers/md/md.c   2008-01-18 11:19:24.0 +1100
@@ -3766,7 +3766,7 @@ static void autorun_devices(int part)
printk(KERN_INFO md: considering %s ...\n,
bdevname(rdev0-bdev,b));
INIT_LIST_HEAD(candidates);
-   ITERATE_RDEV_PENDING(rdev,tmp)
+   rdev_for_each_list(rdev, tmp, pending_raid_disks)
if (super_90_load(rdev, rdev0, 0) = 0) {
printk(KERN_INFO md:  adding %s ...\n,
bdevname(rdev-bdev,b));
@@ -3810,7 +3810,7 @@ static void autorun_devices(int part)
} else {
printk(KERN_INFO md: created %s\n, mdname(mddev));
mddev-persistent = 1;
-   ITERATE_RDEV_GENERIC(candidates,rdev,tmp) {
+   rdev_for_each_list(rdev, tmp, candidates) {
list_del_init(rdev-same_set);
if (bind_rdev_to_array(rdev, mddev))
export_rdev(rdev);
@@ -3821,7 +3821,7 @@ static void autorun_devices(int part)
/* on success, candidates will be empty, on error
 * it won't...
 */
-   ITERATE_RDEV_GENERIC(candidates,rdev,tmp)
+   rdev_for_each_list(rdev, tmp, candidates)
export_rdev(rdev);
mddev_put(mddev);
}
@@ -4936,7 +4936,7 @@ static void status_unused(struct seq_fil
 
seq_printf(seq, unused devices: );
 
-   ITERATE_RDEV_PENDING(rdev,tmp) {
+   rdev_for_each_list(rdev, tmp, pending_raid_disks) {
char b[BDEVNAME_SIZE];
i++;
seq_printf(seq, %s ,

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h 2008-01-18 11:19:09.0 +1100
+++ ./include/linux/raid/md_k.h 2008-01-18 11:19:24.0 +1100
@@ -313,23 +313,17 @@ static inline char * mdname (mddev_t * m
  * iterates through some rdev ringlist. It's safe to remove the
  * current 'rdev'. Dont touch 'tmp' though.
  */
-#define ITERATE_RDEV_GENERIC(head,rdev,tmp)\
+#define rdev_for_each_list(rdev, tmp, list)\
\
-   for ((tmp) = (head).next;   \
+   for ((tmp) = (list).next;   \
(rdev) = (list_entry((tmp), mdk_rdev_t, same_set)), \
-   (tmp) = (tmp)-next, (tmp)-prev != (head) \
+   (tmp) = (tmp)-next, (tmp)-prev != (list) \
; )
 /*
  * iterates through the 'same array disks' ringlist
  */
 #define rdev_for_each(rdev, tmp, mddev)\
-   ITERATE_RDEV_GENERIC((mddev)-disks,rdev,tmp)
-
-/*
- * Iterates through 'pending RAID disks'
- */
-#define ITERATE_RDEV_PENDING(rdev,tmp) \
-   ITERATE_RDEV_GENERIC(pending_raid_disks,rdev,tmp)
+   rdev_for_each_list(rdev, tmp, (mddev)-disks)
 
 typedef struct mdk_thread_s {
void(*run) (mddev_t *mddev);
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 002 of 4] md: Allow devices to be shared between md arrays.

2008-01-18 Thread NeilBrown

Currently, a given device is claimed by a particular array so
that it cannot be used by other arrays.

This is not ideal for DDF and other metadata schemes which have
their own partitioning concept.

So for externally managed metadata, just claim the device for
md in general, require that offset and size are set
properly for each device, and make sure that if a device is
included in different arrays then the active sections do
not overlap.

This involves adding another flag to the rdev which makes it awkward
to set -flags = 0 to clear certain flags.  So now clear flags
explicitly by name when we want to clear things.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c   |   88 +++-
 ./include/linux/raid/md_k.h |2 +
 2 files changed, 80 insertions(+), 10 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2008-01-18 11:03:15.0 +1100
+++ ./drivers/md/md.c   2008-01-18 11:18:04.0 +1100
@@ -774,7 +774,11 @@ static int super_90_validate(mddev_t *md
__u64 ev1 = md_event(sb);
 
rdev-raid_disk = -1;
-   rdev-flags = 0;
+   clear_bit(Faulty, rdev-flags);
+   clear_bit(In_sync, rdev-flags);
+   clear_bit(WriteMostly, rdev-flags);
+   clear_bit(BarriersNotsupp, rdev-flags);
+
if (mddev-raid_disks == 0) {
mddev-major_version = 0;
mddev-minor_version = sb-minor_version;
@@ -1154,7 +1158,11 @@ static int super_1_validate(mddev_t *mdd
__u64 ev1 = le64_to_cpu(sb-events);
 
rdev-raid_disk = -1;
-   rdev-flags = 0;
+   clear_bit(Faulty, rdev-flags);
+   clear_bit(In_sync, rdev-flags);
+   clear_bit(WriteMostly, rdev-flags);
+   clear_bit(BarriersNotsupp, rdev-flags);
+
if (mddev-raid_disks == 0) {
mddev-major_version = 1;
mddev-patch_version = 0;
@@ -1402,7 +1410,7 @@ static int bind_rdev_to_array(mdk_rdev_t
goto fail;
}
list_add(rdev-same_set, mddev-disks);
-   bd_claim_by_disk(rdev-bdev, rdev, mddev-gendisk);
+   bd_claim_by_disk(rdev-bdev, rdev-bdev-bd_holder, mddev-gendisk);
return 0;
 
  fail:
@@ -1442,7 +1450,7 @@ static void unbind_rdev_from_array(mdk_r
  * otherwise reused by a RAID array (or any other kernel
  * subsystem), by bd_claiming the device.
  */
-static int lock_rdev(mdk_rdev_t *rdev, dev_t dev)
+static int lock_rdev(mdk_rdev_t *rdev, dev_t dev, int shared)
 {
int err = 0;
struct block_device *bdev;
@@ -1454,13 +1462,15 @@ static int lock_rdev(mdk_rdev_t *rdev, d
__bdevname(dev, b));
return PTR_ERR(bdev);
}
-   err = bd_claim(bdev, rdev);
+   err = bd_claim(bdev, shared ? (mdk_rdev_t *)lock_rdev : rdev);
if (err) {
printk(KERN_ERR md: could not bd_claim %s.\n,
bdevname(bdev, b));
blkdev_put(bdev);
return err;
}
+   if (!shared)
+   set_bit(AllReserved, rdev-flags);
rdev-bdev = bdev;
return err;
 }
@@ -1925,7 +1935,8 @@ slot_store(mdk_rdev_t *rdev, const char 
return -ENOSPC;
rdev-raid_disk = slot;
/* assume it is working */
-   rdev-flags = 0;
+   clear_bit(Faulty, rdev-flags);
+   clear_bit(WriteMostly, rdev-flags);
set_bit(In_sync, rdev-flags);
}
return len;
@@ -1950,6 +1961,10 @@ offset_store(mdk_rdev_t *rdev, const cha
return -EINVAL;
if (rdev-mddev-pers)
return -EBUSY;
+   if (rdev-size  rdev-mddev-external)
+   /* Must set offset before size, so overlap checks
+* can be sane */
+   return -EBUSY;
rdev-data_offset = offset;
return len;
 }
@@ -1963,16 +1978,69 @@ rdev_size_show(mdk_rdev_t *rdev, char *p
return sprintf(page, %llu\n, (unsigned long long)rdev-size);
 }
 
+static int overlaps(sector_t s1, sector_t l1, sector_t s2, sector_t l2)
+{
+   /* check if two start/length pairs overlap */
+   if (s1+l1 = s2)
+   return 0;
+   if (s2+l2 = s1)
+   return 0;
+   return 1;
+}
+
 static ssize_t
 rdev_size_store(mdk_rdev_t *rdev, const char *buf, size_t len)
 {
char *e;
unsigned long long size = simple_strtoull(buf, e, 10);
+   unsigned long long oldsize = rdev-size;
if (e==buf || (*e  *e != '\n'))
return -EINVAL;
if (rdev-mddev-pers)
return -EBUSY;
rdev-size = size;
+   if (size  oldsize  rdev-mddev-external) {
+   /* need to check that all other rdevs with the same -bdev
+* do not overlap.  We need to unlock the mddev to avoid
+* a deadlock.  We have already changed rdev-size, and if
+  

[PATCH 000 of 4] md: assorted md patched - please read carefully.

2008-01-18 Thread NeilBrown

Following are 4 patches for md.

The first two replace
   md-allow-devices-to-be-shared-between-md-arrays.patch
which was recently remove.  They should go at the same place in the
series, between
md-allow-a-maximum-extent-to-be-set-for-resyncing.patch
and
md-lock-address-when-changing-attributes-of-component-devices.patch

The third is a replacement for

md-change-iterate_rdev_generic-to-rdev_for_each_list-and-remove-iterate_rdev_pending.patch

which conflicts with the above change.

The last is a fix for
md-fix-an-occasional-deadlock-in-raid5.patch

which makes me a lot happier about this patch.  It introduced a
performance regression and I now understand why.  I'm now happy for
that patch with this fix to go into 2.6.24 if that is convenient (If
not, 2.6.24.1 will do).

Thanks,
NeilBrown


 [PATCH 001 of 4] md: Set and test the -persistent flag for md devices more 
consistently.
 [PATCH 002 of 4] md: Allow devices to be shared between md arrays.
 [PATCH 003 of 4] md: Change ITERATE_RDEV_GENERIC to rdev_for_each_list, and 
remove ITERATE_RDEV_PENDING.
 [PATCH 004 of 4] md: Fix an occasional deadlock in raid5 - FIX
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 001 of 4] md: Set and test the -persistent flag for md devices more consistently.

2008-01-18 Thread NeilBrown

If you try to start an array for which the number of raid disks is
listed as zero, md will currently try to read metadata off any devices
that have been given.  This was done because the value of raid_disks
is used to signal whether array details have been provided by
userspace (raid_disks  0) or must be read from the devices
(raid_disks == 0).

However for an array without persistent metadata (or with externally
managed metadata) this is the wrong thing to do.  So we add a test in
do_md_run to give an error if raid_disks is zero for non-persistent
arrays.

This requires that mddev-persistent is set corrently at this point,
which it currently isn't for in-kernel autodetected arrays.

So set -persistent for autodetect arrays, and remove the settign in
super_*_validate which is now redundant.

Also clear -persistent when stopping an array so it is consistently
zero when starting an array.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c |9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2008-01-18 10:46:49.0 +1100
+++ ./drivers/md/md.c   2008-01-18 11:03:15.0 +1100
@@ -779,7 +779,6 @@ static int super_90_validate(mddev_t *md
mddev-major_version = 0;
mddev-minor_version = sb-minor_version;
mddev-patch_version = sb-patch_version;
-   mddev-persistent = 1;
mddev-external = 0;
mddev-chunk_size = sb-chunk_size;
mddev-ctime = sb-ctime;
@@ -1159,7 +1158,6 @@ static int super_1_validate(mddev_t *mdd
if (mddev-raid_disks == 0) {
mddev-major_version = 1;
mddev-patch_version = 0;
-   mddev-persistent = 1;
mddev-external = 0;
mddev-chunk_size = le32_to_cpu(sb-chunksize)  9;
mddev-ctime = le64_to_cpu(sb-ctime)  ((1ULL  32)-1);
@@ -3219,8 +3217,11 @@ static int do_md_run(mddev_t * mddev)
/*
 * Analyze all RAID superblock(s)
 */
-   if (!mddev-raid_disks)
+   if (!mddev-raid_disks) {
+   if (!mddev-persistent)
+   return -EINVAL;
analyze_sbs(mddev);
+   }
 
chunk_size = mddev-chunk_size;
 
@@ -3627,6 +3628,7 @@ static int do_md_stop(mddev_t * mddev, i
mddev-resync_max = MaxSector;
mddev-reshape_position = MaxSector;
mddev-external = 0;
+   mddev-persistent = 0;
 
} else if (mddev-pers)
printk(KERN_INFO md: %s switched to read-only mode.\n,
@@ -3735,6 +3737,7 @@ static void autorun_devices(int part)
mddev_unlock(mddev);
} else {
printk(KERN_INFO md: created %s\n, mdname(mddev));
+   mddev-persistent = 1;
ITERATE_RDEV_GENERIC(candidates,rdev,tmp) {
list_del_init(rdev-same_set);
if (bind_rdev_to_array(rdev, mddev))
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-18 Thread Justin Piszcz



On Fri, 18 Jan 2008, Bill Davidsen wrote:


Justin Piszcz wrote:



On Thu, 17 Jan 2008, Al Boldi wrote:


Justin Piszcz wrote:

On Wed, 16 Jan 2008, Al Boldi wrote:

Also, can you retest using dd with different block-sizes?


I can do this, moment..


I know about oflag=direct but I choose to use dd with sync and measure 
the

total time it takes.
/usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero
of=/r1/bigfile bs=1M count=10240; sync'

So I was asked on the mailing list to test dd with various chunk sizes,
here is the length of time it took
to write 10 GiB and sync per each chunk size:

4=chunk.txt:0:25.46
8=chunk.txt:0:25.63
16=chunk.txt:0:25.26
32=chunk.txt:0:25.08
64=chunk.txt:0:25.55
128=chunk.txt:0:25.26
256=chunk.txt:0:24.72
512=chunk.txt:0:24.71
1024=chunk.txt:0:25.40
2048=chunk.txt:0:25.71
4096=chunk.txt:0:27.18
8192=chunk.txt:0:29.00
16384=chunk.txt:0:31.43
32768=chunk.txt:0:50.11
65536=chunk.txt:2:20.80


What do you get with bs=512,1k,2k,4k,8k,16k...


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



root  4621  0.0  0.0  12404   760 pts/2D+   17:53   0:00 mdadm -S 
/dev/md3

root  4664  0.0  0.0   4264   728 pts/5S+   17:54   0:00 grep D

Tried to stop it when it was re-syncing, DEADLOCK :(

[  305.464904] md: md3 still in use.
[  314.595281] md: md_do_sync() got signal ... exiting

Anyhow, done testing, time to move data back on if I can kill the resync 
process w/out deadlock.


So does that indicate that there is still a deadlock issue, or that you don't 
have the latest patches installed?


--
Bill Davidsen [EMAIL PROTECTED]
Woe unto the statesman who makes war without a reason that will still
be valid when the war is over... Otto von Bismark 



I was trying to stop the raid when it was building, vanilla 2.6.23.14.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-18 Thread Greg Cormier
 Also, don't use ext*, XFS can be up to 2-3x faster (in many of the
 benchmarks).

I'm going to swap file systems and give it a shot right now! :)

How is stability of XFS? I heard recovery is easier with ext2/3 due to
more people using it, more tools available, etc?

Greg
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks ... for real now

2008-01-18 Thread michael

Quoting Norman Elton [EMAIL PROTECTED]:


I posed the question a few weeks ago about how to best accommodate
software RAID over an array of 48 disks (a Sun X4500 server, a.k.a.
Thumper). I appreciate all the suggestions.

Well, the hardware is here. It is indeed six Marvell 88SX6081 SATA
controllers, each with eight 1TB drives, for a total raw storage of
48TB. I must admit, it's quite impressive. And loud. More information
about the hardware is available online...

http://www.sun.com/servers/x64/x4500/arch-wp.pdf

It came loaded with Solaris, configured with ZFS. Things seemed to
work fine. I did not do any benchmarks, but I can revert to that
configuration if necessary.

Now I've loaded RHEL onto the box. For a first-shot, I've created one
RAID-5 array (+ 1 spare) on each of the controllers, then used LVM to
create a VolGroup across the arrays.

So now I'm trying to figure out what to do with this space. So far,
I've tested mke2fs on a 1TB and a 5TB LogVol.

I wish RHEL would support XFS/ZFS, but for now, I'm stuck with ext3.
Am I better off sticking with relatively small partitions (2-5 TB), or
should I crank up the block size and go for one big partition?


Impressive system. I'm curious to what the storage drives look like  
and how they attach to the server with that many disks?
Sounds like you have some time to play around before shoving it into  
production.

I wonder how long it would take to run an fsck on one large filesystem?

Cheers,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-18 Thread Justin Piszcz



On Fri, 18 Jan 2008, Greg Cormier wrote:


Also, don't use ext*, XFS can be up to 2-3x faster (in many of the
benchmarks).


I'm going to swap file systems and give it a shot right now! :)

How is stability of XFS? I heard recovery is easier with ext2/3 due to
more people using it, more tools available, etc?

Greg



Recovery is actually easier with XFS because the repair filesystem code is 
built-into the kernel (you dont need a utility to fix it)-- however, there 
is xfs_repair (if) the in-kernel-tree part could not fix it.


I have been using it for  4-5 years? now.

Also, with CoRaids (ATA over Ethernet) many of them are above 8TB and ext3 
only works up to 8TB, so its not even an option any longer.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-18 Thread Justin Piszcz



On Fri, 18 Jan 2008, Greg Cormier wrote:


Justin, thanks for the script. Here's my results. I ran it a few times
with different tests, hence the small number of results you see here,
I slowly trimmed out the obvious not-ideal sizes.

Nice, we all love benchmarks!! :)



System
---
Athlon64 3500
2GB RAM
4x500GB WD Raid editions, raid 5. SDE is the old 4-platter version
(5000YS), the others are the 3 platter version. Faster :-)

Ok.



/dev/sdb:
Timing buffered disk reads:  240 MB in  3.00 seconds =  79.91 MB/sec
/dev/sdc:
Timing buffered disk reads:  248 MB in  3.01 seconds =  82.36 MB/sec
/dev/sdd:
Timing buffered disk reads:  248 MB in  3.02 seconds =  82.22 MB/sec
/dev/sde:  (older model, 4 platters instead of 3)
Timing buffered disk reads:  210 MB in  3.01 seconds =  69.87 MB/sec
/dev/md3:
Timing buffered disk reads:  628 MB in  3.00 seconds = 209.09 MB/sec


Testing
---
Test was : dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync
64-chunka.txt:2:00.63
128-chunka.txt:2:00.20
256-chunka.txt:2:01.67
512-chunka.txt:2:19.90
1024-chunka.txt:2:59.32

For your configuration, a 64-256k chunk seems optimal for this, hypothetical
benchmark :)





Test was : Unraring multipart RAR's, 1.2 gigabytes. Source and dest
drive were the raid array.
64-chunkc.txt:1:04.20
128-chunkc.txt:0:49.37
256-chunkc.txt:0:48.88
512-chunkc.txt:0:41.20
1024-chunkc.txt:0:40.82

1 meg looks like its the best, which is what I use today, 1 MiB chunk offers
the best peformance by far, at least with all of my testing (with big files)
such as the tests you performed.





So, there's a toss up between 256 and 512.

Yeah for DD performance, not real-life.


If I'm interpreting
correctly here, raw throughput is better with 256, but 512 seems to
work better with real-world stuff? 

Look above, 1 MiB got you the fastest unrar time.


I'll try to think up another test
or two perhaps, and removing 64 as one of the possible options to save
time (mke2fs takes a while on 1.5TB)
Also, don't use ext*, XFS can be up to 2-3x faster (in many of the 
benchmarks).




Next step will be playing with read aheads and stripe cache sizes I
guess! I'm open to any comments/suggestions you guys have!

Greg


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-18 Thread Bill Davidsen

Justin Piszcz wrote:



On Thu, 17 Jan 2008, Al Boldi wrote:


Justin Piszcz wrote:

On Wed, 16 Jan 2008, Al Boldi wrote:

Also, can you retest using dd with different block-sizes?


I can do this, moment..


I know about oflag=direct but I choose to use dd with sync and 
measure the

total time it takes.
/usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero
of=/r1/bigfile bs=1M count=10240; sync'

So I was asked on the mailing list to test dd with various chunk sizes,
here is the length of time it took
to write 10 GiB and sync per each chunk size:

4=chunk.txt:0:25.46
8=chunk.txt:0:25.63
16=chunk.txt:0:25.26
32=chunk.txt:0:25.08
64=chunk.txt:0:25.55
128=chunk.txt:0:25.26
256=chunk.txt:0:24.72
512=chunk.txt:0:24.71
1024=chunk.txt:0:25.40
2048=chunk.txt:0:25.71
4096=chunk.txt:0:27.18
8192=chunk.txt:0:29.00
16384=chunk.txt:0:31.43
32768=chunk.txt:0:50.11
65536=chunk.txt:2:20.80


What do you get with bs=512,1k,2k,4k,8k,16k...


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



root  4621  0.0  0.0  12404   760 pts/2D+   17:53   0:00 mdadm 
-S /dev/md3

root  4664  0.0  0.0   4264   728 pts/5S+   17:54   0:00 grep D

Tried to stop it when it was re-syncing, DEADLOCK :(

[  305.464904] md: md3 still in use.
[  314.595281] md: md_do_sync() got signal ... exiting

Anyhow, done testing, time to move data back on if I can kill the 
resync process w/out deadlock.


So does that indicate that there is still a deadlock issue, or that you 
don't have the latest patches installed?


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3ware and erroneous multipathing - duplicate serial numbers (was: 3ware and dmraid)

2008-01-18 Thread Ask Bjørn Hansen


On Jan 18, 2008, at 4:33 AM, Heinz Mauelshagen wrote:

Much later I figured out that dmraid -b reported two of the disks  
as

being the same:


Looks like the md sync duplicated the metadata and dmraid just spots
that duplication. You gotta remove one of the duplicates to clean  
this up

but check first which to pick in case the sync was partial only.



The event counter is the same on both; is that what I should look for?

Is there a way to reset the dmraid metadata?   I'm not actually using  
dmraid, I use regular software raid so I think I just need to reset  
the dmraid data...



 - ask

--
http://develooper.com/ - http://askask.com/


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks ... for real now

2008-01-18 Thread Greg Cormier
 I wonder how long it would take to run an fsck on one large filesystem?

:)

I would imagine you'd have time to order a new system, build it, and
restore the backups before the fsck was done!
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3ware and erroneous multipathing - duplicate serial numbers (was: 3ware and dmraid)

2008-01-18 Thread Heinz Mauelshagen
On Fri, Jan 18, 2008 at 03:23:24AM -0800, Ask Bjørn Hansen wrote:

 On Jan 18, 2008, at 3:17 AM, Ask Bjørn Hansen wrote:

 [ Uh, I just realized that I forgot to update the subject line as I figured 
 out what was going on; it's obviously not a software raid problem but a 
 multipath problem ]

 One of my boxes crashed (with a hardware error, I think - CPU and 
 motherboard replacements are on their way).  I booted it up on a rescue 
 disk (Fedora 8) to let the software raid sync up.

 When it was running I noticed that one of the disks were listed as dm-5 
 and ... uh-oh ... there was a disk missing.   I figured out that the 
 multipath stuff for some reason had setup two of the disks as 
 /dev/mapper/mpath0 and now md was syncing to this device.

 Much later I figured out that dmraid -b reported two of the disks as 
 being the same:

Looks like the md sync duplicated the metadata and dmraid just spots
that duplication. You gotta remove one of the duplicates to clean this up
but check first which to pick in case the sync was partial only.

Regards,
Heinz-- The LVM Guy --



 /dev/sda:976541696 total, W553841781E0A2001842
 /dev/sdb:976773168 total, V600VXZG
 /dev/sdc:586114704 total, U1757241
 /dev/sdd:976773168 total, U1907712
 /dev/sde:976773168 total, U2133609
 /dev/sdf:976773168 total, D2994402
 /dev/sdg:625140335 total, U2130349
 /dev/sdh:976773168 total, U1541228
 /dev/sdi:976771055 total, W5267124
 /dev/sdj:976773168 total, U1409513
 /dev/sdk:976773168 total, U1409513

 Any idea how this could happen?   All 11 disks are on a 3ware 9650 
 controller (the first one is a single 3ware device, the rest are JBOD).  
 I tried rebooting and booting on a Fedora 7 DVD with the same result.

 [ all this of course seems to have messed up my raid10 badly -- more on 
 that tomorrow ]


 - ask

 -- 
 http://develooper.com/ - http://askask.com/

 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Heinz Mauelshagen Red Hat GmbH
Consulting Development Engineer   Am Sonnenhang 11
Storage Development   56242 Marienrachdorf
  Germany
[EMAIL PROTECTED]PHONE +49  171 7803392
  FAX   +49 2626 924446
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks ... for real now

2008-01-18 Thread Norman Elton
It is quite a box. There's a picture of the box with the cover removed
on Sun's website:

http://www.sun.com/images/k3/k3_sunfirex4500_4.jpg

From the X4500 homepage, there's a gallery of additional pictures. The
drives drop in from the top. Massive fans channel air in the small
gaps between the drives. It doesn't look like there's much room
between the disks, but a lot of cold air gets sucked in the front, and
a lot of hot air comes out the back. So it must be doing its job :).

I have not tried a fsck on it yet. I'll probably setup a lot of 2TB
partitions rather than a single large partition. Then write the
software to handle storing data across many partitions.

Norman

On 1/18/08, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Quoting Norman Elton [EMAIL PROTECTED]:

  I posed the question a few weeks ago about how to best accommodate
  software RAID over an array of 48 disks (a Sun X4500 server, a.k.a.
  Thumper). I appreciate all the suggestions.
 
  Well, the hardware is here. It is indeed six Marvell 88SX6081 SATA
  controllers, each with eight 1TB drives, for a total raw storage of
  48TB. I must admit, it's quite impressive. And loud. More information
  about the hardware is available online...
 
  http://www.sun.com/servers/x64/x4500/arch-wp.pdf
 
  It came loaded with Solaris, configured with ZFS. Things seemed to
  work fine. I did not do any benchmarks, but I can revert to that
  configuration if necessary.
 
  Now I've loaded RHEL onto the box. For a first-shot, I've created one
  RAID-5 array (+ 1 spare) on each of the controllers, then used LVM to
  create a VolGroup across the arrays.
 
  So now I'm trying to figure out what to do with this space. So far,
  I've tested mke2fs on a 1TB and a 5TB LogVol.
 
  I wish RHEL would support XFS/ZFS, but for now, I'm stuck with ext3.
  Am I better off sticking with relatively small partitions (2-5 TB), or
  should I crank up the block size and go for one big partition?

 Impressive system. I'm curious to what the storage drives look like
 and how they attach to the server with that many disks?
 Sounds like you have some time to play around before shoving it into
 production.
 I wonder how long it would take to run an fsck on one large filesystem?

 Cheers,
 Mike
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-18 Thread Greg Cormier
Justin, thanks for the script. Here's my results. I ran it a few times
with different tests, hence the small number of results you see here,
I slowly trimmed out the obvious not-ideal sizes.

System
---
Athlon64 3500
2GB RAM
4x500GB WD Raid editions, raid 5. SDE is the old 4-platter version
(5000YS), the others are the 3 platter version. Faster :-)

/dev/sdb:
 Timing buffered disk reads:  240 MB in  3.00 seconds =  79.91 MB/sec
/dev/sdc:
 Timing buffered disk reads:  248 MB in  3.01 seconds =  82.36 MB/sec
/dev/sdd:
 Timing buffered disk reads:  248 MB in  3.02 seconds =  82.22 MB/sec
/dev/sde:  (older model, 4 platters instead of 3)
 Timing buffered disk reads:  210 MB in  3.01 seconds =  69.87 MB/sec
/dev/md3:
 Timing buffered disk reads:  628 MB in  3.00 seconds = 209.09 MB/sec


Testing
---
Test was : dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync
64-chunka.txt:2:00.63
128-chunka.txt:2:00.20
256-chunka.txt:2:01.67
512-chunka.txt:2:19.90
1024-chunka.txt:2:59.32


Test was : Unraring multipart RAR's, 1.2 gigabytes. Source and dest
drive were the raid array.
64-chunkc.txt:1:04.20
128-chunkc.txt:0:49.37
256-chunkc.txt:0:48.88
512-chunkc.txt:0:41.20
1024-chunkc.txt:0:40.82



So, there's a toss up between 256 and 512. If I'm interpreting
correctly here, raw throughput is better with 256, but 512 seems to
work better with real-world stuff? I'll try to think up another test
or two perhaps, and removing 64 as one of the possible options to save
time (mke2fs takes a while on 1.5TB)

Next step will be playing with read aheads and stripe cache sizes I
guess! I'm open to any comments/suggestions you guys have!

Greg
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks ... for real now

2008-01-18 Thread Jon Lewis

On Thu, 17 Jan 2008, Janek Kozicki wrote:


I wish RHEL would support XFS/ZFS, but for now, I'm stuck with ext3.


there is ext4 (or ext4dev) - it's an ext3 modified to support 1024 PB size
(1048576 TB). You could check if it's feasible. Personally I'd always
stick with ext2/ext3/ext4 since it is most widely used and thus has
the best recovery tools.


Something else to keep in mind...XFS fs repair tools require large amounts 
of memory.  If you were to create one or a few really huge fs's on this 
array, you might end up with fs's which can't be repaired because you 
don't have or even can't get a machine with enough RAM for the job...not 
to mention the amount of time it would take.


--
 Jon Lewis   |  I route
 Senior Network Engineer |  therefore you are
 Atlantic Net|
_ http://www.lewis.org/~jlewis/pgp for PGP public key_
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html