Re: + md-raid10-fix-use-after-free-of-bio.patch added to -mm tree

2007-07-30 Thread Neil Brown
On Saturday July 28, [EMAIL PROTECTED] wrote:
 
 The patch titled
  md: raid10: fix use-after-free of bio
 has been added to the -mm tree.  Its filename is
  md-raid10-fix-use-after-free-of-bio.patch
 
 *** Remember to use Documentation/SubmitChecklist when testing your code ***
 
 See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
 out what to do about this
 
 --
 Subject: md: raid10: fix use-after-free of bio
 From: Maik Hampel [EMAIL PROTECTED]
 
 In case of read errors raid10d tries to print a nice error message,
 unfortunately using data from an already put bio.
 

Thanks for catching that Maik!

 
 diff -puN drivers/md/raid10.c~md-raid10-fix-use-after-free-of-bio 
 drivers/md/raid10.c
 --- a/drivers/md/raid10.c~md-raid10-fix-use-after-free-of-bio
 +++ a/drivers/md/raid10.c
 @@ -1534,7 +1534,6 @@ static void raid10d(mddev_t *mddev)
   bio = r10_bio-devs[r10_bio-read_slot].bio;
   r10_bio-devs[r10_bio-read_slot].bio =
   mddev-ro ? IO_BLOCKED : NULL;
 - bio_put(bio);
   mirror = read_balance(conf, r10_bio);
   if (mirror == -1) {
   printk(KERN_ALERT raid10: %s: unrecoverable 
 I/O
 @@ -1542,8 +1541,10 @@ static void raid10d(mddev_t *mddev)
  bdevname(bio-bi_bdev,b),
  (unsigned long long)r10_bio-sector);
   raid_end_bio_io(r10_bio);
 + bio_put(bio);

and for catching that Andrew!

Acked-By: NeilBrown [EMAIL PROTECTED]


   } else {
   const int do_sync = 
 bio_sync(r10_bio-master_bio);
 + bio_put(bio);
   rdev = conf-mirrors[mirror].rdev;
   if (printk_ratelimit())
   printk(KERN_ERR raid10: %s: 
 redirecting sector %llu to
 _
 
 Patches currently in -mm which might be from [EMAIL PROTECTED] are
 
 md-raid10-fix-use-after-free-of-bio.patch
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 resync data direction defined?

2007-07-30 Thread Luca Berra

On Fri, Jul 27, 2007 at 03:07:13PM +0200, Frank van Maarseveen wrote:

I'm experimenting with a live migration of /dev/sda1 using mdadm -B
and network block device as in:

mdadm -B -ayes -n2 -l1 /dev/md1 /dev/sda1 \
--write-mostly -b /tmp/bitm$$ --write-behind /dev/nbd1

not a good idea


/dev/sda1 is to be migrated. During the migration the local system
mounts from /dev/md1 instead. Stracing shows that data flows to the
remote side. But when I do
echo repair /sys/block/md1/md/sync_action

then the data flows in the other direction: the local disk is written
using data read from the remote side.

I believe stracing nbd will give you a partial view of what happens.
anyway in the first case since the second device is write-mostly, all
data is read from local and changes are written to remote
In the second one the data is read from both sides to be compared, that
is what you are seing on strace, i am unsure as to which copy is
considered correct, since md does not have info about that.


If that would happen in the first command then it would destroy all

yes

data instead of migrating it so I wonder if this behavior is defined:

no

Do mdadm --build and mdadm --create always use the first component device
on the command-line as the source for raid1 resync?

no

if you are doing a migration, build the initial array with the second
device as missing
then hot-add it and it will resync correctly
i.e
mdadm -B -ayes -n2 -l1 /dev/md1 /dev/sda1 \
   --write-mostly -b /tmp/bitm$$ --write-behind missing
mdadm -a /dev/md1 /dev/sda1 


--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md: raid10: fix use-after-free of bio

2007-07-30 Thread Maik Hampel
Am Samstag, den 28.07.2007, 23:55 -0700 schrieb Andrew Morton:
 On Fri, 27 Jul 2007 16:46:23 +0200 Maik Hampel [EMAIL PROTECTED] wrote:
 
  In case of read errors raid10d tries to print a nice error message,
  unfortunately using data from an already put bio.
  
  Signed-off-by: Maik Hampel [EMAIL PROTECTED]
  
  diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
  index f730a14..ea1b3e3 100644
  --- a/drivers/md/raid10.c
  +++ b/drivers/md/raid10.c
  @@ -1557,7 +1557,6 @@ static void raid10d(mddev_t *mddev)
  bio = r10_bio-devs[r10_bio-read_slot].bio;
  r10_bio-devs[r10_bio-read_slot].bio =
  mddev-ro ? IO_BLOCKED : NULL;
  -   bio_put(bio);
  mirror = read_balance(conf, r10_bio);
  if (mirror == -1) {
  printk(KERN_ALERT raid10: %s: unrecoverable 
  I/O
  @@ -1567,6 +1566,7 @@ static void raid10d(mddev_t *mddev)
  raid_end_bio_io(r10_bio);
  } else {
  const int do_sync = 
  bio_sync(r10_bio-master_bio);
  +   bio_put(bio);
  rdev = conf-mirrors[mirror].rdev;
  if (printk_ratelimit())
  printk(KERN_ERR raid10: %s: 
  redirecting sector %llu to
  
  
 
 Surely we just leaked that bio if (mirror == -1)?
 
 better:
 
 --- a/drivers/md/raid10.c~md-raid10-fix-use-after-free-of-bio
 +++ a/drivers/md/raid10.c
 @@ -1534,7 +1534,6 @@ static void raid10d(mddev_t *mddev)
   bio = r10_bio-devs[r10_bio-read_slot].bio;
   r10_bio-devs[r10_bio-read_slot].bio =
   mddev-ro ? IO_BLOCKED : NULL;
 - bio_put(bio);
   mirror = read_balance(conf, r10_bio);
   if (mirror == -1) {
   printk(KERN_ALERT raid10: %s: unrecoverable 
 I/O
 @@ -1542,8 +1541,10 @@ static void raid10d(mddev_t *mddev)
  bdevname(bio-bi_bdev,b),
  (unsigned long long)r10_bio-sector);
   raid_end_bio_io(r10_bio);
 + bio_put(bio);
raid_end_bio_io() calls put_all_bios(), which does a bio_put() to
corresponding r10_bio-devs[i]. So this looks like redundant code for
me.
   } else {
   const int do_sync = 
 bio_sync(r10_bio-master_bio);
 + bio_put(bio);
   rdev = conf-mirrors[mirror].rdev;
   if (printk_ratelimit())
   printk(KERN_ERR raid10: %s: 
 redirecting sector %llu to

Regards, 
Maik Hampel


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it possible to grow a RAID-10 array with mdadm?

2007-07-30 Thread Neil Brown
On Sunday July 29, [EMAIL PROTECTED] wrote:
 Hi everyone,
 
 Is it possible to add drives to an active RAID-10 array, using the grow 
 switch with mdadm, just like it is possible with a RAID-5 array? Or perhaps 
 there is another way?
 
 I have been looking for this information for a long time but have been 
 unable to find it anywhere. The man page for mdadm does not mention RAID-10 
 at all so that didn't help either. Has anyone tried it?

The man page for mdadm does not mention it because it is not
supported.

There are several reshape options that I would like to implement
including
  - raid5 - raid6
  - shrinking raid4/5/6
  - raid0 - raid5
  - changing chunksize/layout of raid4/5/6
  - raid10 growing and layout change

unfortunately I haven't yet found/made the time.

Patches are always welcome :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it possible to grow a RAID-10 array with mdadm?

2007-07-30 Thread Tomas France

Thanks for the answer Neil!

The man page for mdadm does not mention it because it is not supported. 



It doesn't actually even mention the possibility to create a RAID-10 array 
(without creating RAID-0 on top of RAID-1 pairs), yet from the info I found, 
a lot of people have been using it for quite a while. Almost as if it was a 
complete secret ;) As for the RAID-10 growing / layout change - I'd 
absolutely love to see that implemented in the (hopefully near) future. 
IMHO, RAID-10 is becoming very popular because of the falling hard drive 
prices.


Tomas


- Original Message - 
From: Neil Brown [EMAIL PROTECTED]

To: Tomas France [EMAIL PROTECTED]
Cc: linux-raid@vger.kernel.org
Sent: Monday, July 30, 2007 10:48 AM
Subject: Re: Is it possible to grow a RAID-10 array with mdadm?



On Sunday July 29, [EMAIL PROTECTED] wrote:

Hi everyone,

Is it possible to add drives to an active RAID-10 array, using the grow
switch with mdadm, just like it is possible with a RAID-5 array? Or 
perhaps

there is another way?

I have been looking for this information for a long time but have been
unable to find it anywhere. The man page for mdadm does not mention 
RAID-10

at all so that didn't help either. Has anyone tried it?


The man page for mdadm does not mention it because it is not
supported.

There are several reshape options that I would like to implement
including
 - raid5 - raid6
 - shrinking raid4/5/6
 - raid0 - raid5
 - changing chunksize/layout of raid4/5/6
 - raid10 growing and layout change

unfortunately I haven't yet found/made the time.

Patches are always welcome :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it possible to grow a RAID-10 array with mdadm?

2007-07-30 Thread Neil Brown
On Monday July 30, [EMAIL PROTECTED] wrote:
 Thanks for the answer Neil!
 
  The man page for mdadm does not mention it because it is not supported. 
  
 
 It doesn't actually even mention the possibility to create a RAID-10 array 
 (without creating RAID-0 on top of RAID-1 pairs), yet from the info I found, 
 a lot of people have been using it for quite a while. Almost as if it was a 
 complete secret ;) As for the RAID-10 growing / layout change - I'd 
 absolutely love to see that implemented in the (hopefully near) future. 
 IMHO, RAID-10 is becoming very popular because of the falling hard drive 
 prices.

What version of mdadm do you have installed (the bottom of the man
page will tell you).
My v2.6.2 manpage mentions raid10 5 times, and man md mentions it 9
times. 
If you have a recent mdadm and there was some particular place in the
man page were you were looking and didn't find raid10, please let me
know and I will try to improve that part of the documentation.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bonnie++ benchmarks for ext2,ext3,ext4,jfs,reiserfs,xfs,zfs on software raid 5

2007-07-30 Thread Justin Piszcz

CONFIG:

Software RAID 5 (400GB x 6): Default mkfs parameters for all filesystems.
Kernel was 2.6.21 or 2.6.22, did these awhile ago.
Hardware was SATA with PCI-e only, nothing on the PCI bus.

ZFS was userspace+fuse of course.
Reiser was V3.
EXT4 was created using the recommended options on its project page.

RAW:

ext2,7760M,56728,96.,180505,51,85484,17.,50946.7,80.,235541,21.,373.667,0,16:10:16/64,2354,27,0,0,8455.67,14.6667,2211.67,26.,0,0,9724,22.
ext3,7760M,52702.7,94.,165005,60,82294.7,20.6667,52664,83.6667,258788,33.,335.8,0,16:10:16/64,858.333,10.6667,10250.3,28.6667,4084,15,897,12.6667,4024.33,12.,2754,11.
ext4,7760M,53129.7,95,164515,59.,101678,31.6667,62194.3,98.6667,266716,22.,405.767,0,16:10:16/64,1963.67,23.6667,0,0,20859,73.6667,1731,21.,9022,23.6667,16410,65.6667
jfs,7760M,54606,92,191997,52,112764,33.6667,63585.3,99,274921,22.,383.8,0,16:10:16/64,344,1,0,0,539.667,0,297.667,1,0,0,340,0
reiserfs,7760M,51056.7,96,180607,67,106907,38.,61231.3,97.6667,275339,29.,441.167,0,16:10:16/64,2516,60.6667,19174.3,60.6667,8194.33,54.,2011,42.6667,6963.67,19.6667,9168.33,68.6667
xfs,7760M,52985.7,93,158342,45,79682,14,60547.3,98,239101,20.,359.667,0,16:10:16/64,415,4,0,0,1774.67,10.6667,454,4.7,14526.3,40,1572,12.6667
zfs,7760M,25601,43.,32198.7,4,13266.3,2,44145.3,68.6667,129278,9,245.167,0,16:10:16/64,218.333,2,2698.33,11.6667,7434.67,14.,244,2,2191.33,11.6667,5613.33,13.

HTML

http://home.comcast.net/~jpiszcz/benchmark/allfs.html

THOUGHTS

Overall JFS seems the fastest but reviewing the mailing list for JFS it 
seems like there a lot of problems, especially when people who use JFS  1 
year, their speed goes to 5 MiB/s over time and the defragfs tool has been 
removed(?) from the source/Makefile and on Google it says not to use it 
due to corruption.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bonnie++ benchmarks for ext2,ext3,ext4,jfs,reiserfs,xfs,zfs on software raid 5

2007-07-30 Thread Dan Williams
[trimmed all but linux-raid from the cc]

On 7/30/07, Justin Piszcz [EMAIL PROTECTED] wrote:
 CONFIG:

 Software RAID 5 (400GB x 6): Default mkfs parameters for all filesystems.
 Kernel was 2.6.21 or 2.6.22, did these awhile ago.
Can you give 2.6.22.1-iop1 a try to see what affect it has on
sequential write performance?

Download:
http://downloads.sourceforge.net/xscaleiop/patches-2.6.22.1-iop1-x86fix.tar.gz

Unpack into your 2.6.22.1 source tree.  Install the x86 series file
cp patches/series.x86 patches/series.  Apply the series with quilt
quilt push -a.

I recommend trying the default chunk size and default
stripe_cache_size as my tests have shown improvement without needing
to perform any tuning.

Thanks,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Homehost suddenly changed on some components

2007-07-30 Thread Max Amanshauser

For the record:

After reading in the archives about similar problems, which were  
probably caused by something else but still close enough, I recreated  
the array with the exact same parameters from the superblock and one  
missing disk.


 mdadm -C /dev/md0 -l 5 -n 10 -c 64 -p ls /dev/sdb1 /dev/sdd1 /dev/ 
sde1 /dev/hde1 /dev/hdb1 /dev/hdf1 /dev/hdh1 /dev/hdg1 /dev/sdc1 missing


Seems to have done the trick, fsck is working right now.


Funny things seem to happen to the superblocks more often than I  
thought. Recreating with one missing disk appears more like a hack  
than a solution to me. Maybe mdadm should have some kind of explicit  
superblock manipulation, like copying from other components or  
importing/exporting from/to a file, so such problems can be solved in  
a safe way?


Just a quick thought. :)


--
Regards,
Max Amanshauser

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bonnie++ benchmarks for ext2,ext3,ext4,jfs,reiserfs,xfs,zfs on software raid 5

2007-07-30 Thread Al Boldi
Justin Piszcz wrote:
 CONFIG:

 Software RAID 5 (400GB x 6): Default mkfs parameters for all filesystems.
 Kernel was 2.6.21 or 2.6.22, did these awhile ago.
 Hardware was SATA with PCI-e only, nothing on the PCI bus.

 ZFS was userspace+fuse of course.

Wow! Userspace and still that efficient.

 Reiser was V3.
 EXT4 was created using the recommended options on its project page.

 RAW:

 ext2,7760M,56728,96.,180505,51,85484,17.,50946.7,80.,235541,21
.,373.667,0,16:10:16/64,2354,27,0,0,8455.67,14.6667,2211.67,26.
,0,0,9724,22.
 ext3,7760M,52702.7,94.,165005,60,82294.7,20.6667,52664,83.6667,258788,
33.,335.8,0,16:10:16/64,858.333,10.6667,10250.3,28.6667,4084,15,897
,12.6667,4024.33,12.,2754,11.
 ext4,7760M,53129.7,95,164515,59.,101678,31.6667,62194.3,98.6667,266716
,22.,405.767,0,16:10:16/64,1963.67,23.6667,0,0,20859,73.6667,1731,2
1.,9022,23.6667,16410,65.6667
 jfs,7760M,54606,92,191997,52,112764,33.6667,63585.3,99,274921,22.,383.
8,0,16:10:16/64,344,1,0,0,539.667,0,297.667,1,0,0,340,0
 reiserfs,7760M,51056.7,96,180607,67,106907,38.,61231.3,97.6667,275339,
29.,441.167,0,16:10:16/64,2516,60.6667,19174.3,60.6667,8194.33,54.3
333,2011,42.6667,6963.67,19.6667,9168.33,68.6667
 xfs,7760M,52985.7,93,158342,45,79682,14,60547.3,98,239101,20.,359.667,
0,16:10:16/64,415,4,0,0,1774.67,10.6667,454,4.7,14526.3,40,1572,12.
6667

 zfs,7760M,

Dissecting some of these numbers:

  speed %cpu  
 25601,43.,
 32198.7,4,
 13266.3, 2,
 44145.3,68.6667,
 129278,9,
 245.167,0,

 16:10:16/64,

  speed %cpu  
 218.333,2,
 2698.33,11.6667,
 7434.67,14.,
 244,2,
 2191.33,11.6667,
 5613.33,13.

Extrapolating these %cpu number makes ZFS the fastest.

Are you sure these numbers are correct?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bonnie++ benchmarks for ext2,ext3,ext4,jfs,reiserfs,xfs,zfs on software raid 5

2007-07-30 Thread Miklos Szeredi
 Extrapolating these %cpu number makes ZFS the fastest.
 
 Are you sure these numbers are correct?

Note, that %cpu numbers for fuse filesystems are inherently skewed,
because the CPU usage of the filesystem process itself is not taken
into account.

So the numbers are not all that good, but according to the zfs-fuse
author it hasn't been optimized yet, so they may improve.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bonnie++ benchmarks for ext2,ext3,ext4,jfs,reiserfs,xfs,zfs on software raid 5

2007-07-30 Thread Dave Kleikamp
On Mon, 2007-07-30 at 10:29 -0400, Justin Piszcz wrote:

 Overall JFS seems the fastest but reviewing the mailing list for JFS it 
 seems like there a lot of problems, especially when people who use JFS  1 
 year, their speed goes to 5 MiB/s over time and the defragfs tool has been 
 removed(?) from the source/Makefile and on Google it says not to use it 
 due to corruption.

The defragfs tool was an unported holdover from OS/2, which is why it
was removed.  There never was a working Linux version.  I have some
ideas to improve jfs allocation to avoid fragmentation problems, but jfs
isn't my full-time job anymore, so I can't promise anything.  I'm not
sure about the corruption claims.  I'd like to hear some specifics on
that.

Anyway, for enterprise use, I couldn't recommend jfs, since there is no
full-time maintainer.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bonnie++ benchmarks for ext2,ext3,ext4,jfs,reiserfs,xfs,zfs on software raid 5

2007-07-30 Thread Justin Piszcz



On Mon, 30 Jul 2007, Miklos Szeredi wrote:


Extrapolating these %cpu number makes ZFS the fastest.

Are you sure these numbers are correct?


Note, that %cpu numbers for fuse filesystems are inherently skewed,
because the CPU usage of the filesystem process itself is not taken
into account.

So the numbers are not all that good, but according to the zfs-fuse
author it hasn't been optimized yet, so they may improve.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



This was performed on an E6300, 1 core was ZFS/FUSE (or quite a bit of it 
anyway)


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bonnie++ benchmarks for ext2,ext3,ext4,jfs,reiserfs,xfs,zfs on software raid 5

2007-07-30 Thread Justin Piszcz



On Mon, 30 Jul 2007, Dan Williams wrote:


[trimmed all but linux-raid from the cc]

On 7/30/07, Justin Piszcz [EMAIL PROTECTED] wrote:

CONFIG:

Software RAID 5 (400GB x 6): Default mkfs parameters for all filesystems.
Kernel was 2.6.21 or 2.6.22, did these awhile ago.

Can you give 2.6.22.1-iop1 a try to see what affect it has on
sequential write performance?

Download:
http://downloads.sourceforge.net/xscaleiop/patches-2.6.22.1-iop1-x86fix.tar.gz

Unpack into your 2.6.22.1 source tree.  Install the x86 series file
cp patches/series.x86 patches/series.  Apply the series with quilt
quilt push -a.

I recommend trying the default chunk size and default
stripe_cache_size as my tests have shown improvement without needing
to perform any tuning.

Thanks,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Will keep in mind for next test, but like I said these were from a while 
ago.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHSET/RFC] Refactor block layer to improve support for stacked devices.

2007-07-30 Thread Neil Brown

Hi,
 I have just sent a patch-set to linux-kernel that touches quite a
 number of block device drives, with particular relevance to md and
 dm.

 Rather than fill lots of peoples mailboxes multiple times (35 patches
 in the set), I only sent the full set to linux-kernel, and am just
 sending this single notification to other relevant lists.

 If you want to look at the patch set (and please do) and are not
 subscribe to linux-kernel, you can view it here:

   http://lkml.org/lkml/2007/7/30/468
 
 or ask and I'll send you all 35 patches.

Below is the introductory email

Thanks,
NeilBrown



From: NeilBrown [EMAIL PROTECTED]
Sender: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Date:   Tue, 31 Jul 2007 12:15:45 +1000


The following 35(!) patches achieve a refactoring of some parts of the
block layer to provide better support for stacked devices.

The core issue is that of letting bio_add_page know the limitation
that the device imposes so that it doesn't create a bio that is too large.

For a unstacked disk device (e.g. scsi), bio_add_page can access
max_nr_sectors and max_nr_segments and some other details to know how
segments should be counted, and does the appropriate checks (this is a
simplification, get is close enough for this discussion).

For stacked devices (dm, md etc) bio_add_page can also call into the
driver via merge_bvec_fn to find out if a page can be added to a bio.

This mostly works for a simple stack (e.g. md on scsi) but breaks down
with more complicated stacks (dm on md on scsi) as the recusive calls
to merge_bvec_fn that are required are difficult to get right, and
don't provide any guarantees in the face of array reconfiguration anyway.
dm and md both take the approach of if the never level down defines
merge_bvec_fn, then set max_sectors to PAGE_SIZE/512 and live with small
requests.

So this patchset introduces a new approach.  bio_add_page is allowed
to create bios as big as it likes, and each layer is responsible for
splitting that bio up as required.

For intermediate levels like raid0, a number of new bios might be
created which refer to parts of the original, including parts of the
bi_io_vec.

For the bottom level driver (__make_request), each struct request
can refer to just part of a bio, so a bio can be effectively split
among several requests (a request can still reference multiple small
bios, and can concievable list parts of large bios and some small bios
as well, though the merging required to achieve this isn't implemented
yet - that patch set is big enough as it is).

This requires that the bi_io_vec become immutable, and that certain
parts of the bio become immutable.

To achieve this, we introduce fields into the bio so that it can point
to just part of the bi_io_vec (an offset and a size) and introduce
similar fields into 'struct request' to refer to only part of a bio list.

I am keen to receive both review and testing.  I have tested it on
SATA drives with a range of md configurations, but haven't tested dm,
or ide-floppy, or various other bits that needed to be changed.

Probably the changes that are mostly likely to raise eyebrows involve
the code to iterate over the segments in a bio or in a 'struct
request', so I'll give a bit more detail about them here.

Previously these (bio_for_each_segment, rq_for_each_bio) were simple
macros that provided pointers into bi_io_vec.  

As the actual segments that a request might need to handle may no
longer be explicitly in bi_io_vec (e.g. an offset might need to be
added, or a size restriction might need to be imposed) this is no
longer possible.  Instead, these functions (now rq_for_each_segment
and bio_for_each_segment) fill in a 'struct bio_vec' with appropriate
values.  e.g.
  struct bio_vec bvec;
  struct bio_iterator i;
  bio_for_each_segment(bvec, bio, i)
use bvec.bv_page, bvec.bi_offset, bvec.bv_len

This might seem like data is being copied around a bit more, but it
should all be in L1 cache and could conceivable be optimised into
registers by the compiler, so I don't believe this is a big problem
(no, I haven't figured a good way to test it).

To achieve this, the for_each macros are now somewhat more complex.
For example, rq_for_each_segment is:

#define bio_for_each_segment_offset(bv, bio, _i, offs, _size)   \
for (_i.i = 0, _i.offset = (bio)-bi_offset + offs, \
 _i.size = min_t(int, _size, (bio)-bi_size - offs);\
 _i.i  (bio)-bi_vcnt  _i.size  0;  \
 _i.i++)\
if (bv = *bio_iovec_idx((bio), _i.i),   \
bv.bv_offset += _i.offset,  \
bv.bv_len = _i.offset  \
? (_i.offset -= bv.bv_len, 0)   \
: (bv.bv_len -= _i.offset,  \
   _i.offset = 0,  

[patch 07/26] md: Fix two raid10 bugs.

2007-07-30 Thread Greg KH

-stable review patch.  If anyone has any objections, please let us know.

--

1/ When resyncing a degraded raid10 which has more than 2 copies of each block,
  garbage can get synced on top of good data.

2/ We round the wrong way in part of the device size calculation, which
  can cause confusion.

Signed-off-by: Neil Brown [EMAIL PROTECTED]
Signed-off-by: Chris Wright [EMAIL PROTECTED]
Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED]
---

 drivers/md/raid10.c |6 ++
 1 file changed, 6 insertions(+)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- linux-2.6.21.6.orig/drivers/md/raid10.c
+++ linux-2.6.21.6/drivers/md/raid10.c
@@ -1867,6 +1867,7 @@ static sector_t sync_request(mddev_t *md
int d = r10_bio-devs[i].devnum;
bio = r10_bio-devs[i].bio;
bio-bi_end_io = NULL;
+   clear_bit(BIO_UPTODATE, bio-bi_flags);
if (conf-mirrors[d].rdev == NULL ||
test_bit(Faulty, conf-mirrors[d].rdev-flags))
continue;
@@ -2037,6 +2038,11 @@ static int run(mddev_t *mddev)
/* 'size' is now the number of chunks in the array */
/* calculate used chunks per device in 'stride' */
stride = size * conf-copies;
+
+   /* We need to round up when dividing by raid_disks to
+* get the stride size.
+*/
+   stride += conf-raid_disks - 1;
sector_div(stride, conf-raid_disks);
mddev-size = stride   (conf-chunk_shift-1);
 

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 08/26] md: Fix bug in error handling during raid1 repair.

2007-07-30 Thread Greg KH
-stable review patch.  If anyone has any objections, please let us know.

--

From: Mike Accetta [EMAIL PROTECTED]

If raid1/repair (which reads all block and fixes any differences
it finds) hits a read error, it doesn't reset the bio for writing
before writing correct data back, so the read error isn't fixed,
and the device probably gets a zero-length write which it might
complain about.

Signed-off-by: Neil Brown [EMAIL PROTECTED]
Signed-off-by: Chris Wright [EMAIL PROTECTED]
Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED]
---

 drivers/md/raid1.c |   21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- linux-2.6.21.6.orig/drivers/md/raid1.c
+++ linux-2.6.21.6/drivers/md/raid1.c
@@ -1240,17 +1240,24 @@ static void sync_request_write(mddev_t *
}
r1_bio-read_disk = primary;
for (i=0; imddev-raid_disks; i++)
-   if (r1_bio-bios[i]-bi_end_io == end_sync_read 
-   test_bit(BIO_UPTODATE, r1_bio-bios[i]-bi_flags)) 
{
+   if (r1_bio-bios[i]-bi_end_io == end_sync_read) {
int j;
int vcnt = r1_bio-sectors  (PAGE_SHIFT- 9);
struct bio *pbio = r1_bio-bios[primary];
struct bio *sbio = r1_bio-bios[i];
-   for (j = vcnt; j-- ; )
-   if 
(memcmp(page_address(pbio-bi_io_vec[j].bv_page),
-  
page_address(sbio-bi_io_vec[j].bv_page),
-  PAGE_SIZE))
-   break;
+
+   if (test_bit(BIO_UPTODATE, sbio-bi_flags)) {
+   for (j = vcnt; j-- ; ) {
+   struct page *p, *s;
+   p = pbio-bi_io_vec[j].bv_page;
+   s = sbio-bi_io_vec[j].bv_page;
+   if (memcmp(page_address(p),
+  page_address(s),
+  PAGE_SIZE))
+   break;
+   }
+   } else
+   j = 0;
if (j = 0)
mddev-resync_mismatches += 
r1_bio-sectors;
if (j  0 || test_bit(MD_RECOVERY_CHECK, 
mddev-recovery)) {

-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html