Re: LVM performance (was: Re: RAID5 to RAID6 reshape?)

2008-02-19 Thread Jon Nelson
On Feb 19, 2008 1:41 PM, Oliver Martin
<[EMAIL PROTECTED]> wrote:
> Janek Kozicki schrieb:
> > hold on. This might be related to raid chunk positioning with respect
> > to LVM chunk positioning. If they interfere there indeed may be some
> > performance drop. Best to make sure that those chunks are aligned together.
>
> Interesting. I'm seeing a 20% performance drop too, with default RAID
> and LVM chunk sizes of 64K and 4M, respectively. Since 64K divides 4M
> evenly, I'd think there shouldn't be such a big performance penalty.
> It's not like I care that much, I only have 100 Mbps ethernet anyway.
> I'm just wondering...
>
> $ hdparm -t /dev/md0
>
> /dev/md0:
>   Timing buffered disk reads:  148 MB in  3.01 seconds =  49.13 MB/sec
>
> $ hdparm -t /dev/dm-0
>
> /dev/dm-0:
>   Timing buffered disk reads:  116 MB in  3.04 seconds =  38.20 MB/sec

I'm getting better performance on a LV than on the underlying MD:

# hdparm -t /dev/md0

/dev/md0:
 Timing buffered disk reads:  408 MB in  3.01 seconds = 135.63 MB/sec
# hdparm -t /dev/raid/multimedia

/dev/raid/multimedia:
 Timing buffered disk reads:  434 MB in  3.01 seconds = 144.04 MB/sec
#

md0 is a 3-disk raid5, 64k chunk, alg. 2, using a bitmap comprised of
7200rpm sata drives from several manufacturers.



-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-06 Thread Jon Nelson
On Feb 6, 2008 12:43 PM, Bill Davidsen <[EMAIL PROTECTED]> wrote:

> Can you create a raid10 with one drive "missing" and add it later? I
> know, I should try it when I get a machine free... but I'm being lazy today.

Yes you can. With 3 drives, however, performance will be awful (at
least with layout far, 2 copies).

IMO raid10,f2 is a great balance of speed and redundancy.
it''s faster than raid5 for reading, about the same for writing. it's
even potentially faster than raid0 for reading, actually.
With 3 disks one should be able to get 3.0 times the speed of one
disk, or slightly more, and each stripe involves only *one* disk
instead of 2 as it does with raid5.

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-03 Thread Jon Nelson
On Feb 3, 2008 5:29 PM, Janek Kozicki <[EMAIL PROTECTED]> wrote:
> Neil Brown said: (by the date of Mon, 4 Feb 2008 10:11:27 +1100)
>
> wow, thanks for quick reply :)
>
> > > 3. Another thing - would raid10,far=2 work when three drives are used?
> > >Would it increase the read performance?
> >
> > Yes.
>
> is far=2 the most I could do to squeeze every possible MB/sec
> performance in raid10 on three discs ?

In my opinion, yes. It has sequential read characteristics that place
at /or better than/ raid0. Writing is slower, about the speed of a
single disk, give or take.  The other two raid10 layouts (near and
offset) are very close in performance to each other - nearly identical
for reading/writing.

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-03 Thread Jon Nelson
On Feb 3, 2008 5:29 PM, Janek Kozicki <[EMAIL PROTECTED]> wrote:
> Neil Brown said: (by the date of Mon, 4 Feb 2008 10:11:27 +1100)
>
> wow, thanks for quick reply :)
>
> > > 3. Another thing - would raid10,far=2 work when three drives are used?
> > >Would it increase the read performance?
> >
> > Yes.
>
> is far=2 the most I could do to squeeze every possible MB/sec
> performance in raid10 on three discs ?

In my opinion, yes. It has sequential read characteristics that place
at /or better than/ raid0. Writing is slower, about the speed of a
single disk, give or take.  The other two raid10 layouts (near and
offset) are very close in performance to each other - nearly identical
for reading/writing.

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


stopped array, but /sys/block/mdN still exists.

2008-01-02 Thread Jon Nelson
This isn't a high priority issue or anything, but I'm curious:

I --stop(ped) an array but /sys/block/md2 remained largely populated.
Is that intentional?

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid10 performance question

2007-12-23 Thread Jon Nelson
I've found in some tests that raid10,f2 gives me the best I/O of any
raid5 or raid10 format. However, the performance of raid10,o2 and
raid10,n2 in degraded mode is nearly identical to the non-degraded
mode performance (for me, this hovers around 100MB/s).  raid10,f2 has
degraded mode performance, writing, that is indistinguishable from
it's non-degraded mode performance. It's the raid10,f2 *read*
performance in degraded mode that is strange - I get almost exactly
50% of the non-degraded mode read performance. Why is that?


-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10: unfair disk load?

2007-12-23 Thread Jon Nelson
On 12/23/07, maobo <[EMAIL PROTECTED]> wrote:
> Hi,all
>
> Yes, I agree some of you. But in my test both using real life trace and
> Iometer test I found that for absolutely read requests, RAID0 is better than
> RAID10 (with same data disks: 3 disks in RAID0, 6 disks in RAID10). I don't
> know why this happen.
>
> I read the code of RAID10 and RAID0 carefully and experiment with printk to
> track the process flow. The only conclusion I report is the complexity of
> RAID10 to process the read request. While for RAID0 it is so simple that it
> does the read more effectively.
>
> How do you think about this of absolutely read requests?
> Thank you very much!

My own tests on identical hardware (same mobo, disks, partitions,
everything) and same software, with the only difference being how
mdadm is invoked (the only changes here being level and possibly
layout) show that raid0 is about 15% faster on reads than the very
fast raid10, f2 layout. raid10,f2 is approx. 50% of the write speed of
raid0.

Does this make sense?

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10: unfair disk load?

2007-12-22 Thread Jon Nelson
On 12/22/07, Janek Kozicki <[EMAIL PROTECTED]> wrote:
> Michael Tokarev said: (by the date of Fri, 21 Dec 2007 23:56:09 +0300)
>
> > Janek Kozicki wrote:
> > > what's your kernel version? I recall that recently there have been
> > > some works regarding load balancing.
> >
> > It was in my original email:
> > The kernel is 2.6.23
> >
> > Strange I missed the new raid10 development you
> > mentioned (I follow linux-raid quite closely).
> > What change(s) you're referring to?
>
> oh sorry it was a patch for raid1, not raid10:
>
>   http://www.spinics.net/lists/raid/msg17708.html
>
> I'm wondering if it could be adapted for raid10 ...
>
> Konstantin Sharlaimov said: (by the date of Sat, 03 Nov 2007
> 20:08:42 +1000)
>
> > This patch adds RAID1 read balancing to device mapper. A read operation
> > that is close (in terms of sectors) to a previous read or write goes to
> > the same mirror.

Looking at the source for raid10 it already looks like it does some
read balancing.
For raid10 f2 on a 3 drive raid I've found really impressive
performance numbers - as good as raid0. Write speeds are a bit lower
but rather better than raid5 on the same devices.

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm --stop goes off and never comes back?

2007-12-22 Thread Jon Nelson
On 12/22/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Wednesday December 19, [EMAIL PROTECTED] wrote:
> > On 12/19/07, Jon Nelson <[EMAIL PROTECTED]> wrote:
> > > On 12/19/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> > > > On Tuesday December 18, [EMAIL PROTECTED] wrote:
> > > > >
> > > > > I tried to stop the array:
> > > > >
> > > > > mdadm --stop /dev/md2
> > > > >
> > > > > and mdadm never came back. It's off in the kernel somewhere. :-(
>
> Looking at your stack traces, you have the "mdadm -S" holding
> an md lock and trying to get a sysfs lock as part of tearing down the
> array, and 'hald' is trying to read some attribute in
>/sys/block/md
> and is holding the sysfs lock and trying to get the md lock.
> A classic AB-BA deadlock.
>
> >
> > NOTE: kernel is stock openSUSE 10.3 kernel, x86_64, 2.6.22.13-0.3-default.
> >
>
> It is fixed in mainline with some substantial changes to sysfs.
> I don't imagine they are likely to get back ported to openSUSE, but
> you could try logging a bugzilla if you like.

Nah - I'm eagerly awaiting new kernels anyway as I have some network
cards that work much better (read: they work) with 2.6.24rc3+.

> The 'hald' process is interruptible and killing it would release the
> deadlock.

Cool.

> I suspect you have to be fairly unlucky to lose the race but it is
> obviously quite possible.

Sometimes we are all a little unlucky. In my case, it cost me a reboot
or, in others, nothing at all. Fortunately this was not a production
system with lots of users.

> I don't think there is anything I can do on the md side to avoid the
> bug.

In the situation I don't think that such a change would be warranted anyway.
Thanks again for looking at this. I'm a big believer in the 'canary in
a coal mine' mentality - some problems may indications of much more
serious issues, but in this case, it would appear that the issue has
already been taken care of. Have a Happy Holidays.

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm --stop goes off and never comes back?

2007-12-19 Thread Jon Nelson
On 12/19/07, Jon Nelson <[EMAIL PROTECTED]> wrote:
> On 12/19/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> > On Tuesday December 18, [EMAIL PROTECTED] wrote:
> > > This just happened to me.
> > > Create raid with:
> > >
> > > mdadm --create /dev/md2 --level=raid10 --raid-devices=3
> > > --spare-devices=0 --layout=o2 /dev/sdb3 /dev/sdc3 /dev/sdd3
> > >
> > > cat /proc/mdstat
> > >
> > > md2 : active raid10 sdd3[2] sdc3[1] sdb3[0]
> > >   5855424 blocks 64K chunks 2 offset-copies [3/3] [UUU]
> > >   [==>..]  resync = 14.6% (859968/5855424)
> > > finish=1.3min speed=61426K/sec
> > >
> > > Some log messages:
> > >
> > > Dec 18 15:02:28 turnip kernel: md: md2: raid array is not clean --
> > > starting background reconstruction
> > > Dec 18 15:02:28 turnip kernel: raid10: raid set md2 active with 3 out
> > > of 3 devices
> > > Dec 18 15:02:28 turnip kernel: md: resync of RAID array md2
> > > Dec 18 15:02:28 turnip kernel: md: minimum _guaranteed_  speed: 1000
> > > KB/sec/disk.
> > > Dec 18 15:02:28 turnip kernel: md: using maximum available idle IO
> > > bandwidth (but not more than 20 KB/sec) for resync.
> > > Dec 18 15:02:28 turnip kernel: md: using 128k window, over a total of
> > > 5855424 blocks.
> > > Dec 18 15:03:36 turnip kernel: md: md2: resync done.
> > > Dec 18 15:03:36 turnip kernel: md: checkpointing resync of md2.
> > >
> > > I tried to stop the array:
> > >
> > > mdadm --stop /dev/md2
> > >
> > > and mdadm never came back. It's off in the kernel somewhere. :-(
> > >
> > > kill, of course, has no effect.
> > > The machine still runs fine, the rest of the raids (md0 and md1) work
> > > fine (same disks).
> > >
> > > The output (snipped, only mdadm) of 'echo t > /proc/sysrq-trigger'
> > >
> > > Dec 18 15:09:13 turnip kernel: mdadm S 0001e5359fa38fb0 0
> > > 3943  1 (NOTLB)
> > > Dec 18 15:09:13 turnip kernel:  810033e7ddc8 0086
> > >  0092
> > > Dec 18 15:09:13 turnip kernel:  0fc7 810033e7dd78
> > > 80617800 80617800
> > > Dec 18 15:09:13 turnip kernel:  8061d210 80617800
> > > 80617800 
> > > Dec 18 15:09:13 turnip kernel: Call Trace:
> > > Dec 18 15:09:13 turnip kernel:  []
> > > __mutex_lock_interruptible_slowpath+0x8b/0xca
> > > Dec 18 15:09:13 turnip kernel:  [] do_open+0x222/0x2a5
> > > Dec 18 15:09:13 turnip kernel:  [] 
> > > md_seq_show+0x127/0x6c1
> > > Dec 18 15:09:13 turnip kernel:  [] vma_merge+0x141/0x1ee
> > > Dec 18 15:09:13 turnip kernel:  [] seq_read+0x1bf/0x28b
> > > Dec 18 15:09:13 turnip kernel:  [] vfs_read+0xcb/0x153
> > > Dec 18 15:09:13 turnip kernel:  [] sys_read+0x45/0x6e
> > > Dec 18 15:09:13 turnip kernel:  [] system_call+0x7e/0x83
> > >
> > >
> > >
> > > What happened? Is there any debug info I can provide before I reboot?
> >
> > Don't know very odd.
> >
> > The rest of the 'sysrq' output would possibly help.
>
> Does this help? It's the same syscall and args, I think, as above.
>
> Dec 18 15:09:13 turnip kernel: hald  S 0001e52f4793e397 0
> 3040  1 (NOTLB)
> Dec 18 15:09:13 turnip kernel:  81003aa51e38 0086
>  802
> 68ee6
> Dec 18 15:09:13 turnip kernel:  81002a97e5c0 81003aa51de8
> 80617800 806
> 17800
> Dec 18 15:09:13 turnip kernel:  8061d210 80617800
> 80617800 810
> 0bb48
> Dec 18 15:09:13 turnip kernel: Call Trace:
> Dec 18 15:09:13 turnip kernel:  []
> get_page_from_freelist+0x3c4/0x545
> Dec 18 15:09:13 turnip kernel:  []
> __mutex_lock_interruptible_slowpath+0x8b/
> 0xca
> Dec 18 15:09:13 turnip kernel:  [] md_attr_show+0x2f/0x64
> Dec 18 15:09:13 turnip kernel:  [] 
> sysfs_read_file+0xb3/0x111
> Dec 18 15:09:13 turnip kernel:  [] vfs_read+0xcb/0x153
> Dec 18 15:09:13 turnip kernel:  [] sys_read+0x45/0x6e
> Dec 18 15:09:13 turnip kernel:  [] system_call+0x7e/0x83
> Dec 18 15:09:13 turnip kernel:

NOTE: kernel is stock openSUSE 10.3 kernel, x86_64, 2.6.22.13-0.3-default.


-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm --stop goes off and never comes back?

2007-12-19 Thread Jon Nelson
On 12/19/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Tuesday December 18, [EMAIL PROTECTED] wrote:
> > This just happened to me.
> > Create raid with:
> >
> > mdadm --create /dev/md2 --level=raid10 --raid-devices=3
> > --spare-devices=0 --layout=o2 /dev/sdb3 /dev/sdc3 /dev/sdd3
> >
> > cat /proc/mdstat
> >
> > md2 : active raid10 sdd3[2] sdc3[1] sdb3[0]
> >   5855424 blocks 64K chunks 2 offset-copies [3/3] [UUU]
> >   [==>..]  resync = 14.6% (859968/5855424)
> > finish=1.3min speed=61426K/sec
> >
> > Some log messages:
> >
> > Dec 18 15:02:28 turnip kernel: md: md2: raid array is not clean --
> > starting background reconstruction
> > Dec 18 15:02:28 turnip kernel: raid10: raid set md2 active with 3 out
> > of 3 devices
> > Dec 18 15:02:28 turnip kernel: md: resync of RAID array md2
> > Dec 18 15:02:28 turnip kernel: md: minimum _guaranteed_  speed: 1000
> > KB/sec/disk.
> > Dec 18 15:02:28 turnip kernel: md: using maximum available idle IO
> > bandwidth (but not more than 20 KB/sec) for resync.
> > Dec 18 15:02:28 turnip kernel: md: using 128k window, over a total of
> > 5855424 blocks.
> > Dec 18 15:03:36 turnip kernel: md: md2: resync done.
> > Dec 18 15:03:36 turnip kernel: md: checkpointing resync of md2.
> >
> > I tried to stop the array:
> >
> > mdadm --stop /dev/md2
> >
> > and mdadm never came back. It's off in the kernel somewhere. :-(
> >
> > kill, of course, has no effect.
> > The machine still runs fine, the rest of the raids (md0 and md1) work
> > fine (same disks).
> >
> > The output (snipped, only mdadm) of 'echo t > /proc/sysrq-trigger'
> >
> > Dec 18 15:09:13 turnip kernel: mdadm S 0001e5359fa38fb0 0
> > 3943  1 (NOTLB)
> > Dec 18 15:09:13 turnip kernel:  810033e7ddc8 0086
> >  0092
> > Dec 18 15:09:13 turnip kernel:  0fc7 810033e7dd78
> > 80617800 80617800
> > Dec 18 15:09:13 turnip kernel:  8061d210 80617800
> > 80617800 
> > Dec 18 15:09:13 turnip kernel: Call Trace:
> > Dec 18 15:09:13 turnip kernel:  []
> > __mutex_lock_interruptible_slowpath+0x8b/0xca
> > Dec 18 15:09:13 turnip kernel:  [] do_open+0x222/0x2a5
> > Dec 18 15:09:13 turnip kernel:  [] md_seq_show+0x127/0x6c1
> > Dec 18 15:09:13 turnip kernel:  [] vma_merge+0x141/0x1ee
> > Dec 18 15:09:13 turnip kernel:  [] seq_read+0x1bf/0x28b
> > Dec 18 15:09:13 turnip kernel:  [] vfs_read+0xcb/0x153
> > Dec 18 15:09:13 turnip kernel:  [] sys_read+0x45/0x6e
> > Dec 18 15:09:13 turnip kernel:  [] system_call+0x7e/0x83
> >
> >
> >
> > What happened? Is there any debug info I can provide before I reboot?
>
> Don't know very odd.
>
> The rest of the 'sysrq' output would possibly help.

Does this help? It's the same syscall and args, I think, as above.

Dec 18 15:09:13 turnip kernel: hald  S 0001e52f4793e397 0
3040  1 (NOTLB)
Dec 18 15:09:13 turnip kernel:  81003aa51e38 0086
 802
68ee6
Dec 18 15:09:13 turnip kernel:  81002a97e5c0 81003aa51de8
80617800 806
17800
Dec 18 15:09:13 turnip kernel:  8061d210 80617800
80617800 810
0bb48
Dec 18 15:09:13 turnip kernel: Call Trace:
Dec 18 15:09:13 turnip kernel:  []
get_page_from_freelist+0x3c4/0x545
Dec 18 15:09:13 turnip kernel:  []
__mutex_lock_interruptible_slowpath+0x8b/
0xca
Dec 18 15:09:13 turnip kernel:  [] md_attr_show+0x2f/0x64
Dec 18 15:09:13 turnip kernel:  [] sysfs_read_file+0xb3/0x111
Dec 18 15:09:13 turnip kernel:  [] vfs_read+0xcb/0x153
Dec 18 15:09:13 turnip kernel:  [] sys_read+0x45/0x6e
Dec 18 15:09:13 turnip kernel:  [] system_call+0x7e/0x83
Dec 18 15:09:13 turnip kernel:


-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Michal Soltys <[EMAIL PROTECTED]> wrote:
> Justin Piszcz wrote:
> >
> > Or is there a better way to do this, does parted handle this situation
> > better?
> >
> > What is the best (and correct) way to calculate stripe-alignment on the
> > RAID5 device itself?
> >
> >
> > Does this also apply to Linux/SW RAID5?  Or are there any caveats that
> > are not taken into account since it is based in SW vs. HW?
> >
> > ---
>
> In case of SW or HW raid, when you place raid aware filesystem directly on
> it, I don't see any potential poblems
>
> Also, if md's superblock version/placement actually mattered, it'd be pretty
> strange. The space available for actual use - be it partitions or filesystem
> directly - should be always nicely aligned. I don't know that for sure though.
>
> If you use SW partitionable raid, or HW raid with partitions, then you would
> have to align it on a chunk boundary manually. Any selfrespecting os
> shouldn't complain a partition doesn't start on cylinder boundary these
> days. LVM can complicate life a bit too - if you want it's volumes to be
> chunk-aligned.

That, for me, is the next question - how can one educate LVM about the
underlying block device such that logical volumes carved out of that
space align properly - many of us have experienced 30% (or so)
performance losses for the convenience of LVM (and mighty convenient
it is).


-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:
> As other posts have detailed, putting the partition on a 64k aligned
> boundary can address the performance problems. However, a poor choice of
> chunk size, cache_buffer size, or just random i/o in small sizes can eat
> up a lot of the benefit.
>
> I don't think you need to give up your partitions to get the benefit of
> alignment.

How might that benefit be realized?
Assume I have 3 disks, /dev/sd{b,c,d} all partitioned identically with
4 partitions, and I want to use /dev/sd{b,c,d}3 for a new SW raid.

What sequence of steps can I take to ensure that my raid is aligned on
a 64K boundary?
What effect do the different superblock formats have, if any, in this situation?

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:
> As other posts have detailed, putting the partition on a 64k aligned
> boundary can address the performance problems. However, a poor choice of
> chunk size, cache_buffer size, or just random i/o in small sizes can eat
> up a lot of the benefit.
>
> I don't think you need to give up your partitions to get the benefit of
> alignment.

How might that benefit be realized?
Assume I have 3 disks, /dev/sd{b,c,d} all partitioned identically with
4 partitions, and I want to use /dev/sd{b,c,d}3 for a new SW raid.

What sequence of steps can I take to ensure that my raid is aligned on
a 64K boundary?
What effect do the different superblock formats have, if any, in this situation?


-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Justin Piszcz <[EMAIL PROTECTED]> wrote:
>
>
> On Wed, 19 Dec 2007, Mattias Wadenstein wrote:
> >> From that setup it seems simple, scrap the partition table and use the
> > disk device for raid. This is what we do for all data storage disks (hw 
> > raid)
> > and sw raid members.
> >
> > /Mattias Wadenstein
> >
>
> Is there any downside to doing that?  I remember when I had to take my

There is one (just pointed out to me yesterday): having the partition
and having it labeled as raid makes identification quite a bit easier
for humans and software, too.

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks

2007-12-18 Thread Jon Nelson
On 12/18/07, Thiemo Nagel <[EMAIL PROTECTED]> wrote:
> >> Performance of the raw device is fair:
> >> # dd if=/dev/md2 of=/dev/zero bs=128k count=64k
> >> 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s
> >>
> >> Somewhat less through ext3 (created with -E stride=64):
> >> # dd if=largetestfile of=/dev/zero bs=128k count=64k
> >> 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s
> >
> > Quite slow?
> >
> > 10 disks (raptors) raid 5 on regular sata controllers:
> >
> > # dd if=/dev/md3 of=/dev/zero bs=128k count=64k
> > 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s
> >
> > # dd if=bigfile of=/dev/zero bs=128k count=64k
> > 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s
>
> Interesting.  Any ideas what could be the reason?  How much do you get
> from a single drive?  -- The Samsung HD501LJ that I'm using gives
> ~84MB/s when reading from the beginning of the disk.
>
> With RAID 5 I'm getting slightly better results (though I really wonder
> why, since naively I would expect identical read performance) but that
> does only account for a small part of the difference:
>
> 16k read64k write
> chunk
> sizeRAID 5  RAID 6  RAID 5  RAID 6
> 128k492 497 268 270
> 256k615 530 288 270
> 512k625 607 230 174
> 1024k   650 620 170 75

It strikes me that these numbers are meaningless without knowing if
that is actual data-to-disk or data-to-memcache-and-some-to-disk-too.
Later versions of 'dd' offer 'conv=fdatasync' which is really handy
(call fdatasync on the output file, syncing JUST the one file, right
before close). Otherwise, oflags=direct will (try) to bypass the
page/block cache.

I can get really impressive numbers, too (over 200MB/s on a single
disk capable of 70MB/s) when I (mis)use dd without fdatasync, et al.

The variation in reported performance can be really huge without
understanding that you aren't actually testing the DISK I/O but *some*
disk I/O and *some* memory caching.




-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mdadm --stop goes off and never comes back?

2007-12-18 Thread Jon Nelson
This just happened to me.
Create raid with:

mdadm --create /dev/md2 --level=raid10 --raid-devices=3
--spare-devices=0 --layout=o2 /dev/sdb3 /dev/sdc3 /dev/sdd3

cat /proc/mdstat

md2 : active raid10 sdd3[2] sdc3[1] sdb3[0]
  5855424 blocks 64K chunks 2 offset-copies [3/3] [UUU]
  [==>..]  resync = 14.6% (859968/5855424)
finish=1.3min speed=61426K/sec

Some log messages:

Dec 18 15:02:28 turnip kernel: md: md2: raid array is not clean --
starting background reconstruction
Dec 18 15:02:28 turnip kernel: raid10: raid set md2 active with 3 out
of 3 devices
Dec 18 15:02:28 turnip kernel: md: resync of RAID array md2
Dec 18 15:02:28 turnip kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Dec 18 15:02:28 turnip kernel: md: using maximum available idle IO
bandwidth (but not more than 20 KB/sec) for resync.
Dec 18 15:02:28 turnip kernel: md: using 128k window, over a total of
5855424 blocks.
Dec 18 15:03:36 turnip kernel: md: md2: resync done.
Dec 18 15:03:36 turnip kernel: md: checkpointing resync of md2.

I tried to stop the array:

mdadm --stop /dev/md2

and mdadm never came back. It's off in the kernel somewhere. :-(

kill, of course, has no effect.
The machine still runs fine, the rest of the raids (md0 and md1) work
fine (same disks).

The output (snipped, only mdadm) of 'echo t > /proc/sysrq-trigger'

Dec 18 15:09:13 turnip kernel: mdadm S 0001e5359fa38fb0 0
3943  1 (NOTLB)
Dec 18 15:09:13 turnip kernel:  810033e7ddc8 0086
 0092
Dec 18 15:09:13 turnip kernel:  0fc7 810033e7dd78
80617800 80617800
Dec 18 15:09:13 turnip kernel:  8061d210 80617800
80617800 
Dec 18 15:09:13 turnip kernel: Call Trace:
Dec 18 15:09:13 turnip kernel:  []
__mutex_lock_interruptible_slowpath+0x8b/0xca
Dec 18 15:09:13 turnip kernel:  [] do_open+0x222/0x2a5
Dec 18 15:09:13 turnip kernel:  [] md_seq_show+0x127/0x6c1
Dec 18 15:09:13 turnip kernel:  [] vma_merge+0x141/0x1ee
Dec 18 15:09:13 turnip kernel:  [] seq_read+0x1bf/0x28b
Dec 18 15:09:13 turnip kernel:  [] vfs_read+0xcb/0x153
Dec 18 15:09:13 turnip kernel:  [] sys_read+0x45/0x6e
Dec 18 15:09:13 turnip kernel:  [] system_call+0x7e/0x83



What happened? Is there any debug info I can provide before I reboot?



-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-08 Thread Jon Nelson
This is what dstat shows me copying lots of large files about (ext3),
one file at a time.
I've benchmarked the raid itself around 65-70 MB/s maximum actual
write I/O so this 3-4MB/s stuff is pretty bad.

I should note that ALL other I/O suffers horribly, even on other filesystems.
What might the cause be?

I should note: going larger in stripe_cache_size (384 and 512)
performance stays the same, going smaller (128) performance
*increases* and stays more steady to 10-13 MB/s.


total-cpu-usage --dsk/sda-- --dsk/sdb-- --dsk/sdc--
--dsk/sdd-- -dsk/total->
usr sys idl wai hiq siq| read  writ: read  writ: read  writ: read
writ: read  writ>
  1   1  95   3   0   0|  12k 4261B: 106k  125k:  83k  110k:  83k
110k: 283k  348k>
  0   5   0  91   1   2|   0 0 :2384k 4744k:2612k 4412k:2336k
4804k:7332k   14M>
  0   4   0  91   1   3|   0 0 :2352k 4964k:2392k 4812k:2620k
4764k:7364k   14M>
  0   4   0  92   1   3|   0 0 :1068k 3524k:1336k 3184k:1360k
2912k:3764k 9620k>
  0   4   0  92   1   2|   0 0 :2304k 2612k:2128k 2484k:2332k
3028k:6764k 8124k>
  0   4   0  92   1   2|   0 0 :1584k 3428k:1252k 3992k:1592k
3416k:4428k   11M>
  0   3   0  93   0   2|   0 0 :1400k 2364k:1424k 2700k:1584k
2592k:4408k 7656k>
  0   4   0  93   1   2|   0 0 :1764k 3084k:1820k 2972k:1796k
2396k:5380k 8452k>
  0   4   0  92   2   3|   0 0 :1984k 3736k:1772k 4024k:1792k
4524k:5548k   12M>
  0   4   0  93   1   2|   0 0 :1852k 3860k:1840k 3408k:1696k
3648k:5388k   11M>
  0   4   0  93   0   2|   0 0 :1328k 2500k:1640k 2348k:1672k
2128k:4640k 6976k>
  0   4   0  92   0   4|   0 0 :1624k 3944k:2080k 3432k:1760k
3704k:5464k   11M>
  0   1   0  97   1   2|   0 0 :1480k 1340k: 976k 1564k:1268k
1488k:3724k 4392k>
  0   4   0  92   1   2|   0 0 :1320k 2676k:1608k 2548k: 968k
2572k:3896k 7796k>
  0   2   0  96   1   1|   0 0 :1856k 1808k:1752k 1988k:1752k
1600k:5360k 5396k>
  0   4   0  92   2   1|   0 0 :1360k 2560k:1240k 2788k:1580k
2940k:4180k 8288k>
  0   2   0  97   1   2|   0 0 :1928k 1456k:1628k 2080k:1488k
2308k:5044k 5844k>
  1   3   0  94   2   2|   0 0 :1432k 2156k:1320k 1840k: 936k
1072k:3688k 5068k>
  0   3   0  93   2   2|   0 0 :1760k 2164k:1440k 2384k:1276k
2972k:4476k 7520k>
  0   3   0  95   1   2|   0 0 :1088k 1064k: 896k 1424k:1152k
992k:3136k 3480k>
  0   0   0  96   0   2|   0 0 : 976k  888k: 632k 1120k:1016k
968k:2624k 2976k>
  0   2   0  94   1   2|   0 0 :1120k 1864k: 964k 1776k:1060k
1856k:3144k 5496k>

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-06 Thread Jon Nelson
On 12/6/07, David Rees <[EMAIL PROTECTED]> wrote:
> On Dec 6, 2007 1:06 AM, Justin Piszcz <[EMAIL PROTECTED]> wrote:
> > On Wed, 5 Dec 2007, Jon Nelson wrote:
> >
> > > I saw something really similar while moving some very large (300MB to
> > > 4GB) files.
> > > I was really surprised to see actual disk I/O (as measured by dstat)
> > > be really horrible.
> >
> > Any work-arounds, or just don't perform heavy reads the same time as
> > writes?
>
> What kernel are you using? (Did I miss it in your OP?)
>
> The per-device write throttling in 2.6.24 should help significantly,
> have you tried the latest -rc and compared to your current kernel?

I was using 2.6.22.12 I think (openSUSE kernel).
I can try using pretty much any kernel - I'm preparing to do an
unrelated test using 2.6.24rc4 this weekend. If I remember I'll try to
see what disk I/O looks like there.


-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-05 Thread Jon Nelson
I saw something really similar while moving some very large (300MB to
4GB) files.
I was really surprised to see actual disk I/O (as measured by dstat)
be really horrible.

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Stack Trace. Bad?

2007-11-06 Thread Jon Nelson
I was testing some network throughput today and ran into this.
I'm going to bet it's a forcedeth driver problem but since it also
involve software raid I thought I'd include it.
Whom should I contact regarding the forcedeth problem?

The following is only an harmless informational message.
Unless you get a _continuous_flood_ of these messages it means
everything is working fine. Allocations from irqs cannot be
perfectly reliable and the kernel is designed to handle that.
md0_raid5: page allocation failure. order:2, mode:0x20

Call Trace:
   [] __alloc_pages+0x324/0x33d
 [] kmem_getpages+0x66/0x116
 [] fallback_alloc+0x104/0x174
 [] kmem_cache_alloc_node+0x9c/0xa8
 [] __alloc_skb+0x65/0x138
 [] :forcedeth:nv_alloc_rx_optimized+0x4d/0x18f
 [] :forcedeth:nv_napi_poll+0x61f/0x71c
 [] net_rx_action+0xb2/0x1c5
 [] __do_softirq+0x65/0xce
 [] call_softirq+0x1c/0x28
 [] do_softirq+0x2c/0x7d
 [] do_IRQ+0xb6/0xd6
 [] ret_from_intr+0x0/0xa
   [] mempool_free_slab+0x0/0xe
 [] _spin_unlock_irqrestore+0x8/0x9
 [] bitmap_daemon_work+0xee/0x2f3
 [] md_check_recovery+0x22/0x4b9
 [] :raid456:raid5d+0x1b/0x3a2
 [] del_timer_sync+0xc/0x16
 [] schedule_timeout+0x92/0xad
 [] process_timeout+0x0/0x5
 [] schedule_timeout+0x85/0xad
 [] md_thread+0xf2/0x10e
 [] autoremove_wake_function+0x0/0x2e
 [] md_thread+0x0/0x10e
 [] kthread+0x47/0x73
 [] child_rip+0xa/0x12
 [] kthread+0x0/0x73
 [] child_rip+0x0/0x12

Mem-info:
Node 0 DMA per-cpu:
CPU0: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
CPU1: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU0: Hot: hi:  186, btch:  31 usd: 115   Cold: hi:   62, btch:  15 usd:  31
CPU1: Hot: hi:  186, btch:  31 usd: 128   Cold: hi:   62, btch:  15 usd:  56
Active:111696 inactive:116497 dirty:31 writeback:0 unstable:0
 free:1850 slab:19676 mapped:3608 pagetables:1217 bounce:0
Node 0 DMA free:3988kB min:40kB low:48kB high:60kB active:232kB
inactive:5496kB present:10692kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 994 994
Node 0 DMA32 free:3412kB min:4012kB low:5012kB high:6016kB
active:446552kB inactive:460492kB present:1018020kB pages_scanned:0
all_unreclaimable? no
lowmem_reserve[]: 0 0 0
Node 0 DMA: 29*4kB 2*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB
1*1024kB 1*2048kB 0*4096kB = 3988kB
Node 0 DMA32: 419*4kB 147*8kB 19*16kB 0*32kB 1*64kB 0*128kB 1*256kB
0*512kB 0*1024kB 0*2048kB 0*4096kB = 3476kB
Swap cache: add 57, delete 57, find 0/0, race 0+0
Free swap  = 979608kB
Total swap = 979832kB
 Free swap:   979608kB
262128 pages of RAM
4938 reserved pages
108367 pages shared
0 pages swap cached


-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: degraded after reboot

2007-10-12 Thread Jon Nelson
> You said you had to reboot your box using sysrq. There are chances you
> caused the reboot while all pending data was written to sdb4 and sdc4,
> but not to sda4. So sda4 appears to be non-fresh after the reboot and,
> since mdadm refuses to use non-fresh devices, it kicks sda4.

Can mdadm be told to use non-fresh devices?
What about sdb4: I can understand rewinding an event count (sorta),
but what does this mean:

mdadm: forcing event count in /dev/sdb4(1) from 327615 upto 327626

Since the array is degraded, there are 11 "events" missing from sdb4
(presumably sdc4 had them). Since sda4 is not part of the array, the
events can't be complete, can they?  Why jump *ahead* on events
instead of rewinding?


> Sure. I should have said: It's normal if one disk in a raid5 array is
> missing (or non-fresh).

I do not have a spare for this raid - I am aware of the risks and
mitigate them in other ways.

> To be precise, it means that the event counter for sda4 is less than
> the event counter on the other devices in the array. So mdadm must
> assume the data on sda4 is out of sync and hence the device can't be
> used. If you are not using bitmaps, there is no other way out than
> syncing the whole device, i.e. writing good data (computed from sdb4
> and sdc4) to sda4.
>
> Hope that helps.

Yes, that helps.

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: degraded after reboot

2007-10-12 Thread Jon Nelson
On 10/12/07, Andre Noll <[EMAIL PROTECTED]> wrote:
> On 10:38, Jon Nelson wrote:
> > <4>md: kicking non-fresh sda4 from array!
> >
> > what does that mean?
>
> sda4 was not included because the array has been assembled previously
> using only sdb4 and sdc4. So the data on sda4 is out of date.

I don't understand - over months and months it has always been the three
devices, /dev/sd{a,b,c}4.
I've added and removed bitmaps and done other things but at the time it
rebooted the array had been up, "clean" (non-degraded), and comprised of the
three devices for 4-6 weeks.

> > I also have this:
> >
> > raid5: raid level 5 set md0 active with 2 out of 3 devices, algorithm 2
> > RAID5 conf printout:
> >  --- rd:3 wd:2 fd:1
> >  disk 1, o:1, dev:sdb4
> >  disk 2, o:1, dev:sdc4
>
> This looks normal. The array is up with two working disks.

Two of three which, to me, is "abnormal" (ie, the "normal" state is three
and it's got two).

> > Why was /dev/sda4 kicked?
>
> Because it was non-fresh ;)

OK, but what does that MEAN?


> > md0 : active raid5 sda4[3] sdb4[1] sdc4[2]
> >   613409664 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
> >   [==>..]  recovery = 13.1% (40423368/306704832)
> > finish=68.8min speed=64463K/sec
>
> Seems like your init scripts re-added sda4.

No, I did this by hand. I forgot to say that.

--
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid5: degraded after reboot

2007-10-12 Thread Jon Nelson
I have a software raid5 using /dev/sd{a,b,c}4.
It's been up for months, through many reboots.

I had to do a reboot using sysrq

When the box came back up, the raid did not re-assemble.
I am not using bitmaps.

I believe it comes down to this:

<4>md: kicking non-fresh sda4 from array!

what does that mean?

I also have this:

raid5: raid level 5 set md0 active with 2 out of 3 devices, algorithm 2
RAID5 conf printout:
 --- rd:3 wd:2 fd:1
 disk 1, o:1, dev:sdb4
 disk 2, o:1, dev:sdc4
mdadm: forcing event count in /dev/sdb4(1) from 327615 upto 327626

Why was /dev/sda4 kicked?

Contents of /etc/mdadm.conf:

DEVICE /dev/hd*[a-h][0-9] /dev/sd*[a-h][0-9]
ARRAY /dev/md0 level=raid5 num-devices=3
UUID=b4597c3f:ab953cb9:32634717:ca110bfc

Current /proc/mdstat:

md0 : active raid5 sda4[3] sdb4[1] sdc4[2]
  613409664 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
  [==>..]  recovery = 13.1% (40423368/306704832)
finish=68.8min speed=64463K/sec

65-70KB/s is about what these drives can do so the rebuild speed is just peachy.

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Backups w/ rsync

2007-09-28 Thread Jon Nelson
On 9/28/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:
> What I don't understand is how you use hard links... because a hard link
> needs to be in the same filesystem, and because a hard link is just
> another pointer to the inode and doesn't make a physical copy of the
> data to another device or to anywhere, really.

Yes, I know how hard links work. There is (one) physical copy of the
data when it goes from the filesystem on the raid to the filesystem on
the drbd. Subsequent "copies" of the same file, assuming the file has
not changed, are all hard links on the drbd-backed filesystem. Thus, I
have one *physical* copy of the data and a whole bunch of hard links.
Now, since I'm using drbd I actually have *two* physical copies (for a
total of three if you include the original) because the *other*
machine has a block-for-block copy of the drbd device (or it did, as
of a few days ago).

link-dest basically works like this:

Assuming we are going to "copy" (using that word loosely here) file
"A" from "/source" to "/dest/backup.tmp/", and we've told rsync that
"/dest/backup.1/A" might exist:


If "/dest/backup.1/A" does not exist: make a physical copy from
"/source/A" to "/dest/backup.tmp/A".
If it does exist, and the two files are considered identical, simply
hardlink "/dest/backup.tmp/A" to "/dest/backup.1/A".
When all files are copied, move every "/dest/backup.N" (N is a number)
to "/dest/backup.N+1"
If /dest/backup.31 exists, delete it.
Move /dest/backup.tmp to /dest/backup.1 (which was just renamed /dest/backup.2)

I can do all of this, for 175K files (40G), in under 2 minutes on
modest hardware.
I end up with:
1+1 physical copies of the data (local drbd copy and remote drbd copy)

There is more but if I may suggest: if you want more details contact
me off-line, I'm pretty sure the linux-raid folks couldn't care less
about rsync and drbd.
-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Backups w/ rsync

2007-09-28 Thread Jon Nelson
Please note: I'm having trouble w/gmail's formatting... so please
forgive this if it looks horrible. :-|

On 9/28/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:
>
> Dean S. Messing wrote:
> > It has been some time since I read the rsync man page.  I see that
> > there is (among the bazillion and one switches) a "--link-dest=DIR"
> > switch which I suppose does what you describe.  I'll have to
> > experiment with this and think things through.  Thanks, Michal.
> >
>
> Be aware that rsync is useful for making a *copy* of your files, which
> isn't always the best backup. If the goal is to preserve data and be
> able to recover in time of disaster, it's probably not optimal, while if
> you need frequent access to old or deleted files it's fine.


You are absolutely right when you say it isn't always the best backup. There
IS no 'best' backup.

For example, full and incremental backup methods such as dump and
> restore are usually faster to take and restore than a copy, and allow
> easy incremental backups.


If "copy" meant "full data copy" and not "hard link where possible", I'd
agree with you. However...

I use a nightly rsync (with --link-dest) to backup more than 40 GiB to a
drbd-backed drive. I'll explain why I use drbd in just a moment.

Technically, I have a 3 disk raid5 (Linux Software Raid) which is the
primary store for the data. Then I have a second drive (non-raid) that is
used as a drbd backing store, which I rsync *to* from filesystems built off
of the raid. I keep *30 days* of nightly backups on the drbd volume. The
average difference between nightly backups is about 45MB, or a bit less than
10%. The total disk usage is (on average) about 10% more than a single
backup. On an AMD x86-64 dual core (3600 de-clocked to run at 1GHz) the
entire process takes between 1 and 2 minutes, from start to finish.

Using hard links means I can snapshot ~175,000 files, about 40GiB, in under
2 minutes - something I'd have a hard time doing with dump+restore. I could
easily make incremental or differential copies, and maybe even in that time
frame, but I'm not sure I much advantage in that. Furthermore, as you state,
dump+restore does *not* include the removal of files which for some
scenarios is a huge deal.

The long and short of it is this: using hard links (via rsync or cp or
whatever) to do snapshot backups can be really, really fast and have
significant advantages but there are, as with all things, some downsides.
Those downsides are fairly easily mitigated, however. In my case, I can lose
1 drive of the raid and I'm OK. If I lose 2, then the other drive (not part
of the raid) has the data I care about. If I lose the entire machine, the
*other* machine (the other end of the drbd, only woken up every other day or
so) has the data. Going back 30 days. And a bare-metal "restore" is as fast
as your I/O is.  I back my /really/ important stuff up on DLT.

Thanks again to drbd, when the secondary comes up it communicates with the
primary and is able to figure out only which blocks have changed and only
copies those. On a nightly basis that is usually a couple of hundred
megabytes, and at 12MiB/s that doesn't take terribly long to take care of.

-- 
Jon
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k

2007-06-28 Thread Jon Nelson
On Thu, 28 Jun 2007, Matti Aarnio wrote:

> I do have LVM in between the MD-RAID5 and XFS, so I did also align
> the LVM to that  3 * 256k.

How did you align the LVM ?


--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-26 Thread Jon Nelson
On Tue, 26 Jun 2007, Justin Piszcz wrote:

> 
> 
> On Tue, 26 Jun 2007, Jon Nelson wrote:
> 
> > On Tue, 26 Jun 2007, Justin Piszcz wrote:
> >
> > >
> > >
> > > On Tue, 26 Jun 2007, Jon Nelson wrote:
> > >
> > > > On Mon, 25 Jun 2007, Justin Piszcz wrote:
> > > >
> > > > > Neil has a patch for the bad speed.
> > > >
> > > > What does the patch do?

I repeat: what does the patch do (or is this no longer applicable)?


> > > > sync_speed_max defaults to 20 on this box already.

Altering sync_speed_min...

> > > > I tried a binary search of values between the default (1000)
> > > > and 6 which resulted in some pretty weird behavior:
> > > >
> > > > at values below 26000 the rate (also confirmed via dstat output) stayed
> > > > low.  2-3MB/s.  At 26000 and up, the value jumped more or less instantly
> > > > to 70-74MB/s. What makes 26000 special? If I set the value to 2 why
> > > > do I still get 2-3MB/s actual?

> Sounds quite strange, what chunk size are you using for your RAID?

The default: 64

md0 : active raid5 sdc4[2] sda4[0] sdb4[1]
  613409664 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]


--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-26 Thread Jon Nelson
On Tue, 26 Jun 2007, Justin Piszcz wrote:

> 
> 
> On Tue, 26 Jun 2007, Jon Nelson wrote:
> 
> > On Mon, 25 Jun 2007, Justin Piszcz wrote:
> >
> > > Neil has a patch for the bad speed.
> >
> > What does the patch do?
> >
> > > In the mean time, do this (or better to set it to 30, for instance):
> > >
> > > # Set minimum and maximum raid rebuild speed to 60MB/s.
> > > echo "Setting minimum and maximum resync speed to 60 MiB/s..."
> > > echo 6 > /sys/block/md0/md/sync_speed_min
> > > echo 6 > /sys/block/md0/md/sync_speed_max
> >
> > sync_speed_max defaults to 20 on this box already.
> > I tried a binary search of values between the default (1000)
> > and 6 which resulted in some pretty weird behavior:
> >
> > at values below 26000 the rate (also confirmed via dstat output) stayed
> > low.  2-3MB/s.  At 26000 and up, the value jumped more or less instantly
> > to 70-74MB/s. What makes 26000 special? If I set the value to 2 why
> > do I still get 2-3MB/s actual?


> You want to use sync_speed_min.

I forgot to say I changed the values of sync_speed_min.


--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-26 Thread Jon Nelson
On Mon, 25 Jun 2007, Justin Piszcz wrote:

> Neil has a patch for the bad speed.

What does the patch do?

> In the mean time, do this (or better to set it to 30, for instance):
> 
> # Set minimum and maximum raid rebuild speed to 60MB/s.
> echo "Setting minimum and maximum resync speed to 60 MiB/s..."
> echo 6 > /sys/block/md0/md/sync_speed_min
> echo 6 > /sys/block/md0/md/sync_speed_max

sync_speed_max defaults to 20 on this box already.
I tried a binary search of values between the default (1000) 
and 6 which resulted in some pretty weird behavior:

at values below 26000 the rate (also confirmed via dstat output) stayed 
low.  2-3MB/s.  At 26000 and up, the value jumped more or less instantly 
to 70-74MB/s. What makes 26000 special? If I set the value to 2 why 
do I still get 2-3MB/s actual?

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-25 Thread Jon Nelson
On Mon, 25 Jun 2007, Dan Williams wrote:

> > 7. And now, the question: the best absolute 'write' performance comes
> > with a stripe_cache_size value of 4096 (for my setup). However, any
> > value of stripe_cache_size above 384 really, really hurts 'check' (and
> > rebuild, one can assume) performance.  Why?
> >
> Question:
> After performance goes "bad" does it go back up if you reduce the size
> back down to 384?

Yes, and almost instantly.

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-25 Thread Jon Nelson
On Thu, 21 Jun 2007, Jon Nelson wrote:

> On Thu, 21 Jun 2007, Raz wrote:
> 
> > What is your raid configuration ?
> > Please note that the stripe_cache_size is acting as a bottle neck in some
> > cases.

Well, that's kind of the point of my email. I'll try to restate things, 
as my question appears to have gotten lost.

1. I have a 3x component raid5, ~314G per component. Each component 
happens to be the 4th partition of a 320G SATA drive. Each drive 
can sustain approx. 70MB/s reads/writes. Except for the first 
drive, none of the other partitions are used for anything else at this 
time. The system is nominally quiescent during these tests.

2. The kernel is 2.6.18.8-0.3-default on x86_64 (openSUSE 10.2).

3. My best sustained write performance comes with a stripe_cache_size of 
4096. Larger than that seems to reduce performance, although only very 
slightly.

4. At values below 4096, the absolute write performance is less than the 
best, but only marginally. 

5. HOWEVER, at any value *above* 512 the 'check' performance is REALLY 
BAD. By 'check' performance I mean the value displayed by /proc/mdstat 
after I issue:

echo check > /sys/block/md0/md/sync_action

When I say "REALLY BAD" I mean < 3MB/s. 

6. Here is a short incomplete table of stripe_cache_size to 'check' 
performance:

384 72-73MB/s
512 72-73MB/s
640 73-74MB/s
768.3-3.4MB/s

And the performance stays "bad" as I increase the stripe_cache_size.

7. And now, the question: the best absolute 'write' performance comes 
with a stripe_cache_size value of 4096 (for my setup). However, any 
value of stripe_cache_size above 384 really, really hurts 'check' (and 
rebuild, one can assume) performance.  Why?

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-21 Thread Jon Nelson
On Thu, 21 Jun 2007, Raz wrote:

> What is your raid configuration ?
> Please note that the stripe_cache_size is acting as a bottle neck in some
> cases.

Well, it's 3x SATA drives in raid5. 320G drives each, and I'm using a 
314G partition from each disk (the rest of the space is quiescent).

> On 6/21/07, Jon Nelson <[EMAIL PROTECTED]> wrote:
> >
> > I've been futzing with stripe_cache_size on a 3x component raid5,
> > using 2.6.18.8-0.3-default on x86_64 (openSUSE 10.2).
> >
> > With the value set at 4096 I get pretty great write numbers.
> > 2048 and on down the write numbers slowly drop.
> >
> > However, at values above 512 the 'check' performance is terrible. By
> > 'check' performance I mean the value displayed by /proc/mdstat after
> > I issue:
> >
> > echo check > /sys/block/md0/md/sync_action
> >
> > When I say "terrible" I mean < 3MB/s.
> > When I use 384, the performance goes to ~70MB/s
> > 512.. 72-73MB/s
> > 640.. 73-74MB/s
> >
> > 768.. 3300 K/s. Wow!
> >
> > Can somebody 'splain to me what is going on?
> >
> > --
> > Jon Nelson <[EMAIL PROTECTED]>
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to [EMAIL PROTECTED]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> 
> -- 
> Raz
> 
> 
> 

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


stripe_cache_size and performance

2007-06-21 Thread Jon Nelson

I've been futzing with stripe_cache_size on a 3x component raid5, 
using 2.6.18.8-0.3-default on x86_64 (openSUSE 10.2).

With the value set at 4096 I get pretty great write numbers.
2048 and on down the write numbers slowly drop.

However, at values above 512 the 'check' performance is terrible. By 
'check' performance I mean the value displayed by /proc/mdstat after
I issue:

echo check > /sys/block/md0/md/sync_action

When I say "terrible" I mean < 3MB/s.
When I use 384, the performance goes to ~70MB/s
512.. 72-73MB/s
640.. 73-74MB/s

768.. 3300 K/s. Wow!

Can somebody 'splain to me what is going on?

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Removing devices from RAID-5

2007-06-14 Thread Jon Nelson
On Thu, 14 Jun 2007, Rich Walker wrote:

> 
> Hi,
> 
> I've been having some problems with a machine, and now want to reduce
> the number of drives in the array.
> 
> It started out as an array of 160GB drives, but over time they have
> mostly been replaced by 250GB drives:

If you've been having trouble with a machine, doing risky stuff like 
restructuring raids and/or putting them in degraded mode (even 
temporarily) is a bad idea. That said, ...
 
If you had another 250G disk available you could do it, but it would 
involve the creation of a new array and use of pvmove. You said you 
could knock the space down to just 3x160 which is 480G which 2x250 
should be able to cover.

Assuming you had another 250G to play with (you only need it 
temporarily):

1. remove one 250G from array
2. make new array from second drive and the one you just removed 
   (/dev/mdNEW) (degraded 2/3 raid5)
3. add new array to volume group 
4. use pvmove or whatever to move all physical extents from /dev/md1 to 
   /dev/mdNEW
5. When complete, disable /dev/md1, remove /dev/md1 from LVM
   (don't forget to zero the superblocks, that one always bites me)
6. take one of the now-available 250's and add it to /dev/mdNEW,
   allowing it to reconstruct
7. You now have a 3x250 RAID5 and 2 unused drives, with no
   downtime (except adding a new 250G)
8. Update mdadm.conf and other files as necessary.
9. Since you only needed a 250G drive temporarily, you could play games 
   with fault+remove of the "borrowed" drive and replace it or whatever 
   you want to do...

Otherwise, I don't think you can use mdadm to accomplish this.

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: below 10MB/s write on raid5

2007-06-11 Thread Jon Nelson
On Mon, 11 Jun 2007, Nix wrote:

> On 11 Jun 2007, Justin Piszcz told this:
> > You can do a read test.
> >
> > 10gb read test:
> >
> > dd if=/dev/md0 bs=1M count=10240 of=/dev/null
> >
> > What is the result?
> >
> > I've read that LVM can incur a 30-50% slowdown.
> 
> FWIW I see a much smaller penalty than that.
> 
> loki:~# lvs -o +devices
>   LV   VGAttr   LSize   Origin Snap%  Move Log Copy%  Devices
> [...]
>   usr  raid  -wi-ao   6.00G   /dev/md1(50)
> 
> loki:~# time dd if=/dev/md1 bs=1000 count=502400 of=/dev/null
> 502400+0 records in
> 502400+0 records out
> 50240 bytes (502 MB) copied, 16.2995 s, 30.8 MB/s
> 
> loki:~# time dd if=/dev/raid/usr bs=1000 count=502400 of=/dev/null
> 502400+0 records in
> 502400+0 records out
> 50240 bytes (502 MB) copied, 18.6172 s, 27.0 MB/s

And what is it like with 'iflag=direct' which I really feel you have to 
use, otherwise you get caching.

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: below 10MB/s write on raid5

2007-06-11 Thread Jon Nelson
On Mon, 11 Jun 2007, Justin Piszcz wrote:

> 
> 
> On Mon, 11 Jun 2007, Dexter Filmore wrote:
> 
> > On Monday 11 June 2007 14:47:50 Justin Piszcz wrote:
> > > On Mon, 11 Jun 2007, Dexter Filmore wrote:
> > > > I recently upgraded my file server, yet I'm still unsatisfied with the
> > > > write speed.
> > > > Machine now is a Athlon64 3400+ (Socket 754) equipped with 1GB of RAM.
> > > > The four RAID disks are attached to the board's onbaord sATA controller
> > > > (Sil3114 attached via PCI)
> > > > Kernel is 2.6.21.1, custom on Slackware 11.0.
> > > > RAID is on four Samsung SpinPoint disks, has LVM, 3 volumes atop of each
> > > > XFS.
> > > >
> > > > The machine does some other work, too, but still I would have suspected
> > > > to get into the 20-30MB/s area. Too much asked for?
> > > >
> > > > Dex
> > >
> > > What do you get without LVM?
> >
> > Hard to tell: the PV hogs all of the disk space, can't really do non-LVM
> > tests.
> 
> You can do a read test.
> 
> 10gb read test:
> 
> dd if=/dev/md0 bs=1M count=10240 of=/dev/null

eek! Make sure to use iflag=direct
with that otherwise you'll get cached reads and that will throw
the numbers off considerably.


--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding odd RAID5 I/O patterns

2007-06-07 Thread Jon Nelson
On Thu, 7 Jun 2007, Neil Brown wrote:

> On Wednesday June 6, [EMAIL PROTECTED] wrote:
> >
> > 2. now, if I use oflag=direct, the I/O patterns are very strange:
> >0 (zero) reads from sda or sdb, and 2-3MB/s worth of reads from sdc.
> >11-12 MB/s writes to sda, and 8-9MB/s writes to sdb and sdc.
> >
> >--dsk/sda-- --dsk/sdb-- --dsk/sdc-- --dsk/hda--
> > read  writ: read  writ: read  writ: read  writ
> >   011M:4096B 8448k:2824k 8448k:   0   132k
> >   012M:   0  9024k:3008k 9024k:   0   152k
> >
> >Why is /dev/sdc getting so many reads? This only happens with
> >multiples of 192K for blocksizes. For every other blocksize I tried,
> >the reads are spread across all three disks.
>
> Where letters are 64K chunks, and digits are 64K parity chunks, and
> columns are individual drives, your data is laid out something like
> this:
>
> A   B   1
> C   2   D
> 3   E   F
>
> Your first 192K write contains data for A, B, and C.
> To generate 1 no read is needed.
> To generate '2', it needs to read either C or D.  It chooses D.
> So you get a read from the third drive, and writes to all.
>
> Your next 192K write contains data for D, E, and F.
> The update '2' it finds that C is already in cache and doesn't need to
> read anything.  To generate '3', E and F are both available, so no
> read is needed.
>
> This pattern repeats.

Aha!

> > 3. Why can't I find a blocksize that doesn't require reading from any
> >device? Theoretically, if the chunk size is 64KB, then writing 128KB
> >*should* result in 3 writes and 0 reads, right?
>
> With oflag=direct 128KB should work.  What do you get?
> Without oflag=direct, you have less control.  The VM will flush data
> whenever it wants to and it doesn't know about raid5 alignment
> requirements.

I tried 128KB. Actually, I tried dozens of values and found strange
patterns. Would using 'sync' as well help? [ me tries.. nope. ]

Note: the bitmap in this case remains external (on /dev/hda)
Note: /dev/raid/test is a logical volume carved from a volume group made
whose only physical volume is the raid.

Using:

dd if=/dev/zero of=/dev/raid/test bs=128K oflag=direct

/dev/raid/test is a logical volume carved from a volume group made whose
only physical volume is the raid.


!!
NOTE: after writing this, I decided to test against a raid device 
'in-the-raw' (without LVM). 128KB writes get the expected behavior (no 
reads). Unfortunately, this means LVM is doing something funky (maybe 
expected by others, though...) which means that the rest of this isn't 
specific to raid. Where do I go now to find out what's going on?
!!

When I use 128KB I get reads across all three devices. The following is
from dstat, showing 12-13MB/s writes to each drive, and 3.2MB/s give or
take reads. The pattern remains consistent:

--dsk/sda-- --dsk/sdb-- --dsk/sdc-- --dsk/hda--
 read  writ: read  writ: read  writ: read  writ
2688k   11M:2700k   11M:2696k   11M:   0   136k
2688k   10M:2656k   10M:2688k   10M:   0   124k
2752k   11M:2752k   11M:2688k   11M:   0   128k

(/dev/hda is where the bitmap is stored, so the writes there make
perfect sense - however, why are there any reads on sda, sdb, or sdc?)

> > 4. When using the page cache (no oflag=direct), even with 192KB
> >blocksizes, there are (except for noise) *no* reads from the devices,
> >as expected.  Why does bypassing the page cache, plus the
> >combination of 192KB blocks cause such strange behavior?
>
> Hmm... this isn't what I get... maybe I misunderstood exactly what you
> were asking in '2' abovec??

I should have made clearer that items 1 through 4 have the bitmap on an
external device to avoid having to update it (when internal), if that
matters. Essentially, whenever I use dd *without* oflag=direct,
regardless of the blocksize, dstat shows 0 (zero) reads on the component
devices.

dd if=/dev/zero of=/dev/raid/test bs=WHATEVER
(no oflag=direct)

--dsk/sda-- --dsk/sdb-- --dsk/sdc-- --dsk/hda--
 read  writ: read  writ: read  writ: read  writ
   041M:   041M:   041M:   0   240k
   066M:   076M:   067M:   0   260k


> > 5. If I use an 'internal' bitmap, the write performance is *terrible*. I
> >can't seem to sqeeze more than 8-12MB/s out of it (no page cache) or
> >60MB/s (page cache allowed). When not using the page cache, the reads
> >are spread across all three disks to the tune of 2-4MB per second.
> >The bitmap "file" is only 150KB or so in size, why does storing it
> >internally cause such a huge performance problem?
>
> If the bitmap is internal, you have

Regarding odd RAID5 I/O patterns

2007-06-06 Thread Jon Nelson

What I've got:
openSUSE 10.2 running 2.6.18.8-0.3-default on x86_64 (3600+, dual core)
with a 3 component raid5:

md0 : active raid5 sdc4[2] sda4[0] sdb4[1]
  613409664 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

I was testing sequential writes speeds with dd:

dd if=/dev/zero of=/dev/raid/test bs=192K 

with and without oflag=direct, and various bitmap choices.

with varying blocksizes.

What I'm observing I simply can't explain.

1. if I use an external bitmap, and using the page cache (without 
   oflag=direct), I seem to be able to get up to 116MB/s writes. The I/O 
   patterns as reported by iostat/dstat are reasonable (0 B/S reads, 
   high 60's (with some fluxuation, but not much) MB/s writes, very 
   consistent across all three drives):

   --dsk/sda-- --dsk/sdb-- --dsk/sdc-- --dsk/hda--
read  writ: read  writ: read  writ: read  writ
  067M:  16k   71M:  52k   69M:   0   272k
  067M:  20k   65M:   069M:   0   320k

2. now, if I use oflag=direct, the I/O patterns are very strange:
   0 (zero) reads from sda or sdb, and 2-3MB/s worth of reads from sdc.
   11-12 MB/s writes to sda, and 8-9MB/s writes to sdb and sdc.

   --dsk/sda-- --dsk/sdb-- --dsk/sdc-- --dsk/hda--
read  writ: read  writ: read  writ: read  writ
  011M:4096B 8448k:2824k 8448k:   0   132k
  012M:   0  9024k:3008k 9024k:   0   152k

   Why is /dev/sdc getting so many reads? This only happens with 
   multiples of 192K for blocksizes. For every other blocksize I tried,
   the reads are spread across all three disks.

3. Why can't I find a blocksize that doesn't require reading from any 
   device? Theoretically, if the chunk size is 64KB, then writing 128KB 
   *should* result in 3 writes and 0 reads, right?

4. When using the page cache (no oflag=direct), even with 192KB 
   blocksizes, there are (except for noise) *no* reads from the devices, 
   as expected.  Why does bypassing the page cache, plus the 
   combination of 192KB blocks cause such strange behavior?

5. If I use an 'internal' bitmap, the write performance is *terrible*. I 
   can't seem to sqeeze more than 8-12MB/s out of it (no page cache) or 
   60MB/s (page cache allowed). When not using the page cache, the reads 
   are spread across all three disks to the tune of 2-4MB per second. 
   The bitmap "file" is only 150KB or so in size, why does storing it 
   internally cause such a huge performance problem?

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RAID 6 grow problem

2007-06-06 Thread Jon Nelson
On Wed, 6 Jun 2007, Daniel Korstad wrote:

>  You say you have a RAID with three drive (I assume RAID5) with a read 
> performance of 133MB/s.  There are lots of variables, file system 
> type, cache tuning, but that sounds very reasonable to me.

I just did the math quickly - assuming each drive can sustain (give or 
take) 70MB/s, and I'm using a 3-drive raid5, then (block device) reads 
will only use 2 of those 3 drives resulting in a maximum of 140MB/s. 
I've measured (consistently) 133MB/s using dd (again, with iflag=direct) 
and a blocksize of 768K, so I'm OK with that.

A question posed by a friend is this: assuming 64K blocks, if I read a 
single stripe from a raid (128K, right?) then two drives will be used 
(say, drive A and drive B). If I want the "next" 128K then which drives 
are most likely to be used? Drive A will now have the parity block in 
it's "next" block for head position, but drive B has the next data 
block. Nobody knows where drive C is. Does raid5 use an algorithm 
similar to raid1 in that it chooses the 2 drives whose heads are closest 
or does it utilize some other algorithm? 

> Here is a site with some test for RAID5 and 8 drives in the set using 
> high end hardware raid.

> http://www.rhic.bnl.gov/hepix/talks/041019pm/schoen.pdf
> 8 drives RAID 5 7200 rpm SATA drives = ~180MB/s

I would have expected a peak transfer rate (sustained, block, linear 
access) of MAX(8*70,533) which is 533MB/s, assuming 64bit, 66MHz bus and 
70MB/s drives.

I recently completed a series of some 800+ tests on a 4 disk raid5 
varying the I/O scheduler, readahead of the components, readahead of the 
raid, bitmap present or not, and filesystem and arrived at some fairly 
interesting results. I hope to throw them together in a more usable form 
in the near future.

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RAID 6 grow problem

2007-06-05 Thread Jon Nelson
On Tue, 5 Jun 2007, Daniel Korstad wrote:

> Sounds like you are well on your way.
>  I am not too surprised on the time to completion.  I probably 
> underestimated/exaggerated a bit when I said after a few hours :)
>  It took me over a day to grow one disk as well.  But my experience 
> was on a system with an older AMD 754 x64 Mother Board with a couple 
> SATA on board and the rest on two PCI cards each with 4 SATA ports.  
> So I have 8 SATA drives on my PCI (33Mhz x 4 bytes (32bits) = 133MB/s) 
> bus of which is saturated basically after three drives.

Related to this question, I have several of my own.

I have an EPoX 570SLI motherboard with 3 SATAII drives, all 320GiB: one 
Hitachi, one Samsung, one Seagate. I built a RAID5 out of a partition 
carved from each. I can issue a 'check' command and the rebuild speed 
hovers around 70MB/s, sometimes up to 73MB/s, and dstat/iostat/whatever 
confirms that each drive is sustaining approximately 70MB/s reads. 
Therefore, 3x70MB/s = 210MB/s which is a bunch more than 133MB/s. lspci 
-v reveals, for one of the interfaces (the others are pretty much the 
same):

00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2) 
(prog-if 85 [Master SecO PriO])
Subsystem: EPoX Computer Co., Ltd. Unknown device 1026
Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 11
I/O ports at 09f0 [size=8]
I/O ports at 0bf0 [size=4]
I/O ports at 0970 [size=8]
I/O ports at 0b70 [size=4]
I/O ports at e000 [size=16]
Memory at fe02d000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [44] Power Management version 2
Capabilities: [b0] Message Signalled Interrupts: Mask- 64bit+ 
Queue=0/2 Enable-
Capabilities: [cc] HyperTransport: MSI Mapping

which seems to clearly indicate that it is running at 66MHz (meaning 
266MB/s maximum). As I say below, the best I seem to be able to get out 
of it is 133MB/s, give or take. Can somebody explain what some of those 
other items mean, such as 64bit something and different-sized I/O 
ports...)

Each drive identifies with different UDMA levels:

The hitachi:
ata1.00: ATA-7, max UDMA/133, 625142448 sectors: LBA48 NCQ (depth 0/32) 

The samsung:
ata2.00: ATA-8, max UDMA7, 625142448 sectors: LBA48 NCQ (depth 0/32)

The seagate:
ata3.00: ATA-7, max UDMA/133, 625142448 sectors: LBA48 NCQ (depth 0/32)


I'm trying to determine what the limiting factor of my raid is: Is it 
the drives, my CPU (AMD x86_64, dual core, 3600+), my motherboard, 
software, or something else. The best I've been able to get in userland 
is about 133MB/s (no filesystem, raw device reads using dd with 
iflag=direct).  What *should* I be able to get?

--
Jon Nelson 
<[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very strange (maybe) raid1 testing results

2007-05-30 Thread Jon Nelson
On Wed, 30 May 2007, Jon Nelson wrote:

> On Thu, 31 May 2007, Richard Scobie wrote:
> 
> > Jon Nelson wrote:
> > 
> > > I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s as
> > > reported by dd. What I don't understand is why just one disk is being used
> > > here, instead of two or more. I tried different versions of metadata, and
> > > using a bitmap makes no difference. I created the array with (allowing for
> > > variations of bitmap and metadata version):
> > 
> > This is normal for md RAID1. What you should find is that for 
> > concurrent reads, each read will be serviced by a different disk, 
> > until no. of reads = no. of drives.
> 
> Alright. To clarify, let's assume some process (like a single-threaded 
> webserver) using a raid1 to store content (who knows why, let's just say 
> it is), and also assume that the I/O load is 100% reads. Given that the 
> server does not fork (or create a thread) for each request, does that 
> mean that every single web request is essentially serviced from one 
> disk, always? What mechanism determines which disk actually services the 
> request?

It's probably bad form to reply to one's own posts, but I just found

static int read_balance(conf_t *conf, r1bio_t *r1_bio)

in raid1.c which, if I'm reading the rest of the source correctly, 
basically says "pick the disk whose current head position is closest". 
This *could* explain the behavior I was seeing. Is that not correct?

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very strange (maybe) raid1 testing results

2007-05-30 Thread Jon Nelson
On Thu, 31 May 2007, Richard Scobie wrote:

> Jon Nelson wrote:
> 
> > I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s as
> > reported by dd. What I don't understand is why just one disk is being used
> > here, instead of two or more. I tried different versions of metadata, and
> > using a bitmap makes no difference. I created the array with (allowing for
> > variations of bitmap and metadata version):
> 
> This is normal for md RAID1. What you should find is that for 
> concurrent reads, each read will be serviced by a different disk, 
> until no. of reads = no. of drives.

Alright. To clarify, let's assume some process (like a single-threaded 
webserver) using a raid1 to store content (who knows why, let's just say 
it is), and also assume that the I/O load is 100% reads. Given that the 
server does not fork (or create a thread) for each request, does that 
mean that every single web request is essentially serviced from one 
disk, always? What mechanism determines which disk actually services the 
request?

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


very strange (maybe) raid1 testing results

2007-05-30 Thread Jon Nelson

I assembled a 3-component raid1 out of 3 4GB partitions.
After syncing, I ran the following script:

for bs in 32 64 128 192 256 384 512 768 1024 ; do \
 let COUNT="2048 * 1024 / ${bs}"; \
 echo -n "${bs}K bs - "; \
 dd if=/dev/md1 of=/dev/null bs=${bs}k count=$COUNT iflag=direct 2>&1 | 
 grep 'copied' ; \
done

I also ran 'dstat' (like iostat) in another terminal. What I noticed was 
very unexpected to me, so I re-ran it several times.  I confirmed my 
initial observation - every time a new dd process ran, *all* of the read 
I/O for that process came from a single disk. It does not (appear to) 
have to do with block size -  if I stop and re-run the script the next 
drive in line will take all of the I/O - it goes sda, sdc, sdb and back 
to sda and so on.

I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s 
as reported by dd. What I don't understand is why just one disk is being 
used here, instead of two or more. I tried different versions of 
metadata, and using a bitmap makes no difference. I created the array 
with (allowing for variations of bitmap and metadata version):

mdadm --create --level=1 --raid-devices=3 /dev/md1 /dev/sda3 /dev/sdb3 /dev/sdc3

I am running 2.6.18.8-0.3-default on x86_64, openSUSE 10.2.

Am I doing something wrong or is something weird going on?

--
Jon Nelson <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html