Re: LVM performance (was: Re: RAID5 to RAID6 reshape?)

2008-02-19 Thread Iustin Pop
On Tue, Feb 19, 2008 at 01:52:21PM -0600, Jon Nelson wrote:
 On Feb 19, 2008 1:41 PM, Oliver Martin
 [EMAIL PROTECTED] wrote:
  Janek Kozicki schrieb:
 
  $ hdparm -t /dev/md0
 
  /dev/md0:
Timing buffered disk reads:  148 MB in  3.01 seconds =  49.13 MB/sec
 
  $ hdparm -t /dev/dm-0
 
  /dev/dm-0:
Timing buffered disk reads:  116 MB in  3.04 seconds =  38.20 MB/sec
 
 I'm getting better performance on a LV than on the underlying MD:
 
 # hdparm -t /dev/md0
 
 /dev/md0:
  Timing buffered disk reads:  408 MB in  3.01 seconds = 135.63 MB/sec
 # hdparm -t /dev/raid/multimedia
 
 /dev/raid/multimedia:
  Timing buffered disk reads:  434 MB in  3.01 seconds = 144.04 MB/sec
 #

As people are trying to point out in many lists and docs: hdparm is
*not* a benchmark tool. So its numbers, while interesting, should not be
regarded as a valid comparison.

Just my oppinion.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any inexpensive hardware recommendations for PCI interface cards?

2008-02-08 Thread Iustin Pop
On Fri, Feb 08, 2008 at 08:54:55AM -0500, Justin Piszcz wrote:
 The promise tx4 pci works great and supports sata/300+ncq/etc $60-$70.

Wait, I have used tx4 pci up until ~2.6.22 and it didn't support AFAIK
ncq. Are you sure that current driver supports NCQ? I might then revive
that card :)

thanks,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any inexpensive hardware recommendations for PCI interface cards?

2008-02-08 Thread Iustin Pop
On Fri, Feb 08, 2008 at 02:24:15PM -0500, Justin Piszcz wrote:


 On Fri, 8 Feb 2008, Iustin Pop wrote:

 On Fri, Feb 08, 2008 at 08:54:55AM -0500, Justin Piszcz wrote:
 The promise tx4 pci works great and supports sata/300+ncq/etc $60-$70.

 Wait, I have used tx4 pci up until ~2.6.22 and it didn't support AFAIK
 ncq. Are you sure that current driver supports NCQ? I might then revive
 that card :)

 thanks,
 iustin


 Whoa nice catch, I meant the Promise 300 TX4 which now retails for $59.99 
 w/free ship.

 http://www.newegg.com/Product/Product.aspx?Item=N82E16816102062

:)

Actually, I exactly meant Promise 300 TX4 (the board is in my hand: chip
says PDC40718). The HW supports NCQ, but the linux sata_promise driver
didn't support NCQ when I tested it. Can someone confirm it does today
(2.6.24) NCQ?

iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Iustin Pop
On Thu, Feb 07, 2008 at 01:31:16AM +0100, Keld Jørn Simonsen wrote:
 Anyway, why does a SATA-II drive not deliver something like 300 MB/s?

Wait, are you talking about a *single* drive?

In that case, it seems you are confusing the interface speed (300MB/s)
with the mechanical read speed (80MB/s). If you are asking why is a
single drive limited to 80 MB/s, I guess it's a problem of mechanics.
Even with NCQ or big readahead settings, ~80-~100 MB/s is the highest
I've seen on 7200 RPM drives. And yes, there is no wait until the CPU
processes the current data until the drive reads the next data; drives
have a builtin read-ahead mechanism.

Honestly, I have 10x as many problems with the low random I/O throughput
rather than with the (high, IMHO) sequential I/O speed.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: One Large md or Many Smaller md for Better Peformance?

2008-01-22 Thread Iustin Pop
On Tue, Jan 22, 2008 at 05:34:14AM -0600, Moshe Yudkowsky wrote:
 Carlos Carvalho wrote:

 I use reiser3 and xfs. reiser3 is very good with many small files. A
 simple test shows interactively perceptible results: removing large
 files is faster with xfs, removing large directories (ex. the kernel
 tree) is faster with reiser3.

 My current main concern about XFS and reiser3 is writebacks. The default  
 mode for ext3 is journal, which in case of power failure is more  
 robust than the writeback modes of XFS, reiser3, or JFS -- or so I'm  
 given to understand.

 On the other hand, I have a UPS and it should shut down gracefully  
 regardless if there's a power failure. I wonder if I'm being too 
 cautious?

I'm not sure what your actual worry is. It's not like XFS loses
*commited* data on power failure. It may lose data that was never
required to go to disk via fsync()/fdatasync()/sync. If someone is
losing data on power failure is the unprotected write cache of the
harddrive.

If you have properly-behaved applications, then they know when to do an
fsync and if XFS returns success on fsync and your linux is properly
configured (no write-back caches on drives that are not backed by NVRAM,
etc.) then you won't lose data.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: One Large md or Many Smaller md for Better Peformance?

2008-01-20 Thread Iustin Pop
On Sun, Jan 20, 2008 at 02:24:46PM -0600, Moshe Yudkowsky wrote:
 Question: with the same number of physical drives,  do I get better  
 performance with one large md-based drive, or do I get better  
 performance if I have several smaller md-based drives?

No expert here, but my opinion:
  - md code works better if it's only one array per physical drive,
because it keeps statistics per array (like last accessed sector,
etc.) and if you combine two arrays on the same drive these
statistics are not exactly true anymore
  - simply separating 'application work areas' into different
filesystems is IMHO enogh, no need to separate the raid arrays too
  - if you download torrents, fragmentation is a real problem, so use a
filesystem that knows how to preallocate space (XFS and maybe ext4;
for XFS use xfs_io to set a bigger extend size for where you
download)

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help diagnosing bad disk

2007-12-19 Thread Iustin Pop
On Wed, Dec 19, 2007 at 01:18:21PM -0500, Jon Sabo wrote:
 So I was trying to copy over some Indiana Jones wav files and it
 wasn't going my way.  I noticed that my software raid device showed:
 
 /dev/md1 on / type ext3 (rw,errors=remount-ro)
 
 Is this saying that it was remounted, read only because it found a
 problem with the md1 meta device?  That's what it looks like it's
 saying but I can still write to /.

FYI, it means that it is currently rw, and if there are errors, it
will remount the filesystem readonly (as opposed to panic-ing).

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-20 Thread Iustin Pop
On Sat, Oct 20, 2007 at 10:52:39AM -0400, John Stoffel wrote:
 Michael Well, I strongly, completely disagree.  You described a
 Michael real-world situation, and that's unfortunate, BUT: for at
 Michael least raid1, there ARE cases, pretty valid ones, when one
 Michael NEEDS to mount the filesystem without bringing up raid.
 Michael Raid1 allows that.
 
 Please describe one such case please.

Boot from a raid1 array, such that everything - including the partition
table itself - is mirrored.

iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-19 Thread Iustin Pop
On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote:
 And if putting the superblock at the end is problematic, why is it the
 default?  Shouldn't version 1.1 be the default?  

In my opinion, having the superblock *only* at the end (e.g. the 0.90
format) is the best option.

It allows one to mount the disk separately (in case of RAID 1), if the
MD superblock is corrupt or you just want to get easily at the raw data.

As to the people who complained exactly because of this feature, LVM has
two mechanisms to protect from accessing PVs on the raw disks (the
ignore raid components option and the filter - I always set filters when
using LVM ontop of MD).

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Expose the degraded status of an assembled array through sysfs

2007-10-10 Thread Iustin Pop
On Mon, Sep 10, 2007 at 06:51:14PM +0200, Iustin Pop wrote:
 The 'degraded' attribute is useful to quickly determine if the array is
 degraded, instead of parsing 'mdadm -D' output or relying on the other
 techniques (number of working devices against number of defined devices, 
 etc.).
 The md code already keeps track of this attribute, so it's useful to export 
 it.
 
 Signed-off-by: Iustin Pop [EMAIL PROTECTED]
 ---
 Note: I sent this back in January and it people agreed it was a good
 idea.  However, it has not been picked up. So here I resend it again.

Ping? Neil, could you spare a few moments to look at this? (and sorry for
bothering you)

 
 Patch is against 2.6.23-rc5
 
 Thanks,
 Iustin Pop
 
  drivers/md/md.c |7 +++
  1 files changed, 7 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/md/md.c b/drivers/md/md.c
 index f883b7e..3e3ad71 100644
 --- a/drivers/md/md.c
 +++ b/drivers/md/md.c
 @@ -2842,6 +2842,12 @@ sync_max_store(mddev_t *mddev, const char *buf, size_t 
 len)
  static struct md_sysfs_entry md_sync_max =
  __ATTR(sync_speed_max, S_IRUGO|S_IWUSR, sync_max_show, sync_max_store);
  
 +static ssize_t
 +degraded_show(mddev_t *mddev, char *page)
 +{
 + return sprintf(page, %i\n, mddev-degraded);
 +}
 +static struct md_sysfs_entry md_degraded = __ATTR_RO(degraded);
  
  static ssize_t
  sync_speed_show(mddev_t *mddev, char *page)
 @@ -2985,6 +2991,7 @@ static struct attribute *md_redundancy_attrs[] = {
   md_suspend_lo.attr,
   md_suspend_hi.attr,
   md_bitmap.attr,
 + md_degraded.attr,
   NULL,
  };
  static struct attribute_group md_redundancy_group = {
 -- 
 1.5.3.1
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Speaking of network disks (was: Re: syncing remote homes.)

2007-09-22 Thread Iustin Pop
On Sat, Sep 22, 2007 at 10:28:44AM -0700, Mr. James W. Laferriere wrote:
   Hello Bill  all ,

 Bill Davidsen [EMAIL PROTECTED] Sat, 22 Sep 2007 09:41:40 -0400 ,  wrote:
 My only advice is to try and quantify the data volume and look at nbd 
 vs. iSCSI to provide the mirror if you go that way.

   You mentioned nbd as a transport for disk to remote disk .

   My Question is have you OR anyone else tried using drbd(*) as a
   method to replicate disk data across networks ?

I have used it only in local networks, but it works very well. It's
much, much better than md + nbd, for example, because it was designed
with the network in mind - so it deals gracefully with transient network
errors and such.

And the current version (8.x) is also more flexible than the previous
versions. I'd recommed you give it a try.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MD RAID1 performance very different from non-RAID partition

2007-09-15 Thread Iustin Pop
On Sat, Sep 15, 2007 at 12:28:07AM -0500, Jordan Russell wrote:
 (Kernel: 2.6.18, x86_64)
 
 Is it normal for an MD RAID1 partition with 1 active disk to perform
 differently from a non-RAID partition?
 
 md0 : active raid1 sda2[0]
   8193024 blocks [2/1] [U_]
 
 I'm building a search engine database onto this partition. All of the
 source data is cached into memory already (i.e., only writes should be
 hitting the disk).
 If I mount the partition as /dev/md0, building the database consistently
 takes 18 minutes.
 If I stop /dev/md0 and mount the partition as /dev/sda2, building the
 database consistently takes 31 minutes.
 
 Why the difference?

Maybe it's because md doesn't support barriers whereas the disks
supports them? In this case some filesystems, for example XFS, will work
faster on raid1 because they can't force the flush to disk using
barriers.

Just a guess...

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MD RAID1 performance very different from non-RAID partition

2007-09-15 Thread Iustin Pop
On Sat, Sep 15, 2007 at 02:18:19PM +0200, Goswin von Brederlow wrote:
 Shouldn't it be the other way around? With a barrier the filesystem
 can enforce an order on the data written and can then continue writing
 data to the cache. More data is queued up for write. Without barriers
 the filesystem should do a sync at that point and have to wait for the
 write to fully finish. So less is put into cache.

I don't know in general, but XFS will simply not issue any sync at all
if the block device doesn't support barriers. It's the syadmin's job to
either ensure you have barriers or turn off write cache on disk (see the
XFS faq, for example).

However, I never saw such behaviour from MD (i.e. claiming the write has
completed while the disk underneath is still receiving data to write
from Linux) so I'm not sure this is what happens here. In my experience,
MD acknowledges a write only when it has been pushed to the drive (write
cache enabled or not) and there is no buffer between MD and the drive.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reducing the number of disks a RAID1 expects

2007-09-10 Thread Iustin Pop
On Sun, Sep 09, 2007 at 09:31:54PM -1000, J. David Beutel wrote:
 [EMAIL PROTECTED] ~]# mdadm --grow /dev/md5 -n2
 mdadm: Cannot set device size/shape for /dev/md5: Device or resource busy

 mdadm - v1.6.0 - 4 June 2004
 Linux 2.6.12-1.1381_FC3 #1 Fri Oct 21 03:46:55 EDT 2005 i686 athlon i386 
 GNU/Linux

I'm not sure that such an old kernel supports reshaping an array. The
mdadm version should not be a problem, as that message is probably
generated by the kernel.

I'd recommend trying to boot with a newer kernel, even if only for the
duration of the reshape.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Expose the degraded status of an assembled array through sysfs

2007-09-10 Thread Iustin Pop
The 'degraded' attribute is useful to quickly determine if the array is
degraded, instead of parsing 'mdadm -D' output or relying on the other
techniques (number of working devices against number of defined devices, etc.).
The md code already keeps track of this attribute, so it's useful to export it.

Signed-off-by: Iustin Pop [EMAIL PROTECTED]
---
Note: I sent this back in January and it people agreed it was a good
idea.  However, it has not been picked up. So here I resend it again.

Patch is against 2.6.23-rc5

Thanks,
Iustin Pop

 drivers/md/md.c |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f883b7e..3e3ad71 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2842,6 +2842,12 @@ sync_max_store(mddev_t *mddev, const char *buf, size_t 
len)
 static struct md_sysfs_entry md_sync_max =
 __ATTR(sync_speed_max, S_IRUGO|S_IWUSR, sync_max_show, sync_max_store);
 
+static ssize_t
+degraded_show(mddev_t *mddev, char *page)
+{
+   return sprintf(page, %i\n, mddev-degraded);
+}
+static struct md_sysfs_entry md_degraded = __ATTR_RO(degraded);
 
 static ssize_t
 sync_speed_show(mddev_t *mddev, char *page)
@@ -2985,6 +2991,7 @@ static struct attribute *md_redundancy_attrs[] = {
md_suspend_lo.attr,
md_suspend_hi.attr,
md_bitmap.attr,
+   md_degraded.attr,
NULL,
 };
 static struct attribute_group md_redundancy_group = {
-- 
1.5.3.1

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Explain the read-balancing algorithm for RAID1 better in md.4

2007-09-10 Thread Iustin Pop
There are many questions on the mailing list about the RAID1 read
performance profile. This patch adds a new paragraph to the RAID1
section in md.4 that details what kind of speed-up one should expect
from RAID1.

Signed-off-by: Iustin Pop [EMAIL PROTECTED]
---
this patch is against the git tree of mdadm.

 md.4 |7 +++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/md.4 b/md.4
index cf423cb..db39aba 100644
--- a/md.4
+++ b/md.4
@@ -168,6 +168,13 @@ All devices in a RAID1 array should be the same size.  If 
they are
 not, then only the amount of space available on the smallest device is
 used (any extra space on other devices is wasted).
 
+Note that the read balancing done by the driver does not make the RAID1
+performance profile be the same as for RAID0; a single stream of
+sequential input will not be accelerated (e.g. a single dd), but
+multiple sequential streams or a random workload will use more than one
+spindle. In theory, having an N-disk RAID1 will allow N sequential
+threads to read from all disks.
+
 .SS RAID4
 
 A RAID4 array is like a RAID0 array with an extra device for storing
-- 
1.5.3.1

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using my Mirror disk to boot up.

2007-08-29 Thread Iustin Pop
On Wed, Aug 29, 2007 at 08:25:59PM -0700, chee wrote:
 
 i,
 
 This is my Filesystem:
 
 Filesystem Size Used Avail Use% Mounted on
 /dev/md0 9.7G 6.6G 2.7G 72% /
 none 189M 0 189M 0% /dev/shm
 /dev/md2 103G 98G 289M 100% /home
 
 and this is mirror settings:
 
 Personalities : [raid1]
 md1 : active raid1 hda2[0] hdd2[1]
 512000 blocks [2/2] [UU]
 
 md2 : active raid1 hdd3[1] hda3[0]
 109298112 blocks [2/2] [UU]
 
 md0 : active raid1 hdd1[1] hda1[0]
 10240128 blocks [2/2] [UU]
 
 The problem i am facing is my mirror disk does not seem to boot up when i
 swap hard disk to test where my mirroring disk is working. The only thing i
 see was this 'LI' in the monitor and hangs there.
 
The problem is (most likely) that your mirrors only cover the
hd[ad][123] partitions, and not the whole disk. Thus, the MBR of hda is
not synchronized to hdd.

You can do two things here:
  - fix your lilo.conf to correctly write to both hda and hdd (IIRC, you
need the directive raid-extra-boot=mbr or raid-extra-boot=mbr-only,
depending on how exactly you install lilo)
  - change to a partitionable raid array instead of three arrays (one
for each partition); that will cover also the mbr of the drive


regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Iustin Pop
On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:
 
 On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote:
 
  now, I am not an expert on either option, but three are a couple things 
  that I
  would question about the DRDB+MD option
 
  1. when the remote machine is down, how does MD deal with it for reads and
  writes?
 
 I suppose it kicks the drive and you'd have to re-add it by hand unless done 
 by
 a cronjob.

From my tests, since NBD doesn't have a timeout option, MD hangs in the
write to that mirror indefinitely, somewhat like when dealing with a
broken IDE driver/chipset/disk.

  2. MD over local drive will alternate reads between mirrors (or so I've been
  told), doing so over the network is wrong.
 
 Certainly. In which case you set write_mostly (or even write_only, not sure
 of its name) on the raid component that is nbd.
 
  3. when writing, will MD wait for the network I/O to get the data saved on 
  the
  backup before returning from the syscall? or can it sync the data out lazily
 
 Can't answer this one - ask Neil :)

MD has the write-mostly/write-behind options - which help in this case
but only up to a certain amount.


In my experience DRBD wins hands-down over MD+NBD because of MD doesn't
know (or handle) a component that never returns from a write, which is
quite different from returning with an error. Furthermore, DRBD was
designed to handle transient errors in the connection to the peer due to
its network-oriented design, whereas MD is mostly designed with local or
at least high-reliability disks (where disk can be SAN, SCSI, etc.) and
a failure is not normal for MD. Thus the need for manual reconnect in MD
case and the automated handling of reconnects in case of DRBD.

I'm just a happy user of both MD over local disks and DRBD for networked
raid.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Customize the error emails of `mdadm --monitor`

2007-06-06 Thread Iustin Pop
On Wed, Jun 06, 2007 at 01:31:44PM +0200, Peter Rabbitson wrote:
 Peter Rabbitson wrote:
 Hi,
 
 Is there a way to list the _number_ in addition to the name of a 
 problematic component? The kernel trend to move all block devices into 
 the sdX namespace combined with the dynamic name allocation renders 
 messages like /dev/sdc1 has problems meaningless. It would make remote 
 server support so much easier, by allowing the administrator to label 
 drive trays Component0 Component1 Component2... etc, and be sure that 
 the local tech support person will not pull out the wrong drive from the 
 system.
 
 
 Any takers? Or is it a RTFM question (in which case I certainly 
 overlooked the relevant doc)?

If you use udev, have you looked in /dev/disk? I think it solves the
problem you need by allowing one to see either the disks by id or by
path. Making the reverse map is then trivial (for a reasonable number of
disks).

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Customize the error emails of `mdadm --monitor`

2007-06-06 Thread Iustin Pop
On Wed, Jun 06, 2007 at 02:23:31PM +0200, Peter Rabbitson wrote:
 Iustin Pop wrote:
 On Wed, Jun 06, 2007 at 01:31:44PM +0200, Peter Rabbitson wrote:
 Peter Rabbitson wrote:
 Hi,
 
 Is there a way to list the _number_ in addition to the name of a 
 problematic component? The kernel trend to move all block devices into 
 the sdX namespace combined with the dynamic name allocation renders 
 messages like /dev/sdc1 has problems meaningless. It would make remote 
 server support so much easier, by allowing the administrator to label 
 drive trays Component0 Component1 Component2... etc, and be sure that 
 the local tech support person will not pull out the wrong drive from the 
 system.
 
 Any takers? Or is it a RTFM question (in which case I certainly 
 overlooked the relevant doc)?
 
 If you use udev, have you looked in /dev/disk? I think it solves the
 problem you need by allowing one to see either the disks by id or by
 path. Making the reverse map is then trivial (for a reasonable number of
 disks).
 
 
 This would not work as arrays are assembled by the kernel at boot time, 
 at which point there is no udev or anything else for that matter other 
 than /dev/sdX. And I am pretty sure my OS (debian) does not support udev 
 in initrd as of yet.

Ah, I see. But then, sysfs should help (I presume sysfs being a standard
kernel filesystem can be mounted in the initrd). I think that most of
the information the kernel has for the device is present in sysfs. At
least a crude form of path mapping to the real controller is available
via the symlink /sys/block/sdN/device. I don't know if it really helps
your case.

iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Same UUID for every member of all array ?

2007-04-13 Thread Iustin Pop
On Thu, Apr 12, 2007 at 02:57:57PM +0200, Brice Figureau wrote:
 Now, I don't know why all the UUID are equals (my other machines are not
 affected).
I think at some point either in sarge or in testing between sarge and
etch, there was included a version of mdadm which had this bug (all
arrays had the same uuid). Yeah, it bit me too a little :)

 Is there a possibility to hot change the UUID of each array (and
 change the corresponding superblocks of each member) so that my next
 boot will work ?
did you read the manpage for mdadm (the version in etch)? It has a -U
argument to assemble which does what you want.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 does not seem faster

2007-04-09 Thread Iustin Pop
On Mon, Apr 09, 2007 at 06:53:26AM -0400, Justin Piszcz wrote:
 Using 2 threads made no difference either.
 
 It was not until I did 3 simultaneous copies that I saw 110-130MB/s 
 through vmstat 1, until then, it only used one drive, even with two cp's, 
 how come it needs to be three or more?

Because, as I understand it, it's an optimisation, not a rule. Quoting
from the manpage (md):
Once initialised, each device in a RAID1 array contains exactly the
same data.  Changes are written to all devices in parallel.  Data is
read from any one device.  The driver attempts to distribute read
requests across all devices to maximise performance.

The key word here is attempts. I looked a while ago over the source
code and IIRC it says that it tries to direct a read request to the
drive whose head is closest to the requested sector, or if not possible
a random drive. To me, this seems a good strategy, which optimises
server-type workloads.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 does not seem faster

2007-04-05 Thread Iustin Pop
On Wed, Apr 04, 2007 at 07:11:50PM -0400, Bill Davidsen wrote:
 You are correct, but I think if an optimization were to be done, some 
 balance between the read time, seek time, and read size could be done. 
 Using more than one drive only makes sense when the read transfer time is 
 significantly longer than the seek time. With an aggressive readahead set 
 for the array that would happen regularly.
 
 It's possible, it just takes the time to do it, like many other nice 
 things.

Maybe yes, but why optimise the single-reader case? raid1 already can
read in parallel from the drives when multiple processes read from the
raid1. Optimising the single reader can help in hdparm or other
benchmark cases, but in real life I see very often the total throughput
of a (two drive) raid1 being around two times the throughput of a single
drive.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 does not seem faster

2007-04-05 Thread Iustin Pop
On Thu, Apr 05, 2007 at 04:11:35AM -0400, Justin Piszcz wrote:
 
 
 On Thu, 5 Apr 2007, Iustin Pop wrote:
 
 On Wed, Apr 04, 2007 at 07:11:50PM -0400, Bill Davidsen wrote:
 You are correct, but I think if an optimization were to be done, some
 balance between the read time, seek time, and read size could be done.
 Using more than one drive only makes sense when the read transfer time is
 significantly longer than the seek time. With an aggressive readahead set
 for the array that would happen regularly.
 
 It's possible, it just takes the time to do it, like many other nice
 things.
 
 Maybe yes, but why optimise the single-reader case? raid1 already can
 read in parallel from the drives when multiple processes read from the
 raid1. Optimising the single reader can help in hdparm or other
 benchmark cases, but in real life I see very often the total throughput
 of a (two drive) raid1 being around two times the throughput of a single
 drive.
 
 regards,
 iustin
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 Really? I have copied a file from a SW RAID1 (5GB) and I only saw 
 60MB/s not the 120MB/s the (RAID1) is capable of to the 
 destination (which can easily do  160MB/s sustained read/write).

Did you copy it multi-threaded? I said *multiple-readers* show improved
speed and you said I copied *one* file. Try copying two files in
parallel.

I'm doing in two xterms cat file1 /dev/null, cat file2 /dev/null
and my raid1 shows ~110 MB/s, each drive doing about half.  On file only
does about 60 MB/s (this is over a PCI raid controller so the max 110
MB/s is a PCI bus limitation).

Iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpectedly slow raid1 benchmark results.

2007-03-04 Thread Iustin Pop
On Sun, Mar 04, 2007 at 04:47:19AM -0800, Dan wrote:
 Just about the only stat in these tests that show a marked improvement between
 one and two drives is Random Seeks (which makes sense).  What doesn't make 
 sense
 is that none of the Sequential Input numbers increase.  Shouldn't I be seeing
 close to a 100% improvement?

This is not raid0! Since the drives are identical copies, you can't
really optimize the sequential input since the input data is not
stripped... OTOH, if you have two processes doing sequential input, then
yes, each drive should work close or at the normal sequential input
speed.

Regards,
Iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.20-rc6] md: expose uuid and degraded attributes in sysfs

2007-02-10 Thread Iustin Pop
On Sat, Jan 27, 2007 at 02:59:48AM +0100, Iustin Pop wrote:
 From: Iustin Pop [EMAIL PROTECTED]
 
 This patch exposes the uuid and the degraded status of an assembled
 array through sysfs.
[...]

Sorry to ask, this was my first patch and I'm not sure what is the
procedure to get it considered for merging... I was under the impression
that just sending it to this list is enough. What do I have to do?

Thanks,
Iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.20-rc6] md: expose uuid and degraded attributes in sysfs

2007-02-10 Thread Iustin Pop
On Sun, Feb 11, 2007 at 08:15:31AM +1100, Neil Brown wrote:
 Resending after a suitable pause (1-2 weeks) is never a bad idea.
Ok, noted, thanks.

 Exposing the UUID isn't - and if it were it should be in
 md_default_attrs rather than md_redundancy_attrs.
 
 The UUID isn't an intrinsic aspect of the array.  It is simply part of
 the metadata that is used to match up different devices from the same
 array.
I see. Unfortunately, for now it's the only method of (more or less)
persistently identifying the array.

 I plan to add support for the 'DDF' metadata format (an 'industry
 standard') and that will be managed entirely in user-space.  The
 kernel won't know the uuid at all.
I've briefly looked over the spec, but this seems a non-trivial change,
away from current md superblocks to ddf... But the virtual disk GUIDs
seem nice. In the meantime, probably the solution you gave below is
best.

 So any solution for easy access to uuids should be done in user-space.
 Maybe mdadm could create a link
/dev/md/by-uuid/ - /dev/whatever.
 ??
That sounds like a good idea. mdadm (or udev or another userspace
solution) should work, given some safety measures against stale symlinks
and such. It seems to me that, since now it's possible to assemble
arrays without mdadm (by using sysfs), mdadm is not the best place to do
it. Probably relying on udev is a better option, however it right now it
seems that it gets only the block add events, and not the block remove
ones.

Thanks,
Iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.20-rc6] md: expose uuid and degraded attributes in sysfs

2007-01-26 Thread Iustin Pop
From: Iustin Pop [EMAIL PROTECTED]

This patch exposes the uuid and the degraded status of an assembled
array through sysfs.

The uuid is useful in the case when multiple arrays exist on a system
and userspace needs to identify them; currently, the only portable way
that I know of is using 'mdadm -D' on each device until the desired uuid
is found. Having the uuid visible in sysfs is much more cleaner, IMHO.
Note on the method to format the uuid: I'm not sure if this is the best
way, I've copied and transformed the one in print_sb.

The 'degraded' attribute is also useful to quickly determine if the
array is degraded, instead of, again, parsing 'mdadm -D' output or
relying on the other techniques (number of working devices against
number of defined devices, etc.). The md code already keeps track of
this attribute, so it's useful to export it.

Signed-off-by: Iustin Pop [EMAIL PROTECTED]
---

--- linux-2.6.20-rc6/drivers/md/md.c.orig   2007-01-27 02:31:11.496575360 
+0100
+++ linux-2.6.20-rc6/drivers/md/md.c2007-01-27 02:32:51.746741201 +0100
@@ -2856,6 +2856,22 @@
 static struct md_sysfs_entry md_suspend_hi =
 __ATTR(suspend_hi, S_IRUGO|S_IWUSR, suspend_hi_show, suspend_hi_store);
 
+static ssize_t
+uuid_show(mddev_t *mddev, char *page)
+{
+   __u32 *p = (__u32*)mddev-uuid;
+   return sprintf(page, %08x:%08x:%08x:%08x\n, p[0], p[1], p[2], p[3]);
+}
+static struct md_sysfs_entry md_uuid =
+__ATTR_RO(uuid);
+
+static ssize_t
+degraded_show(mddev_t *mddev, char *page)
+{
+   return sprintf(page, %i\n, mddev-degraded);
+}
+static struct md_sysfs_entry md_degraded =
+__ATTR_RO(degraded);
 
 static struct attribute *md_default_attrs[] = {
md_level.attr,
@@ -2881,6 +2897,8 @@
md_suspend_lo.attr,
md_suspend_hi.attr,
md_bitmap.attr,
+   md_uuid.attr,
+   md_degraded.attr,
NULL,
 };
 static struct attribute_group md_redundancy_group = {
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html