Re: Spare disk could not sleep / standby

2005-03-07 Thread Neil Brown
On Tuesday March 8, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  It is writes, but don't be scared.  It is just super-block updates.
  
  In 2.6, the superblock is marked 'clean' whenever there is a period of
  about 20ms of no write activity.  This increases the chance on a
  resync won't be needed after a crash.
  (unfortunately) the superblocks on the spares need to be updated too.
 
 Ack, one of the cool things that a linux md array can do that others
 can't is imho that the disks can spin down when inactive.  Granted,
 it's mostly for home users who want their desktop RAID to be quiet
 when it's not in use, and their basement multi-terabyte facility to
 use a minimum of power when idling, but anyway.
 
 Is there any particular reason to update the superblocks every 20
 msecs when they're already marked clean?


It doesn't (well, shouldn't and I don't think it does).
Before the first write, they are all marked 'active'.
Then after 20ms with no write, they are all marked 'clean'.
Then before the next write they are all marked 'active'.

As the event count needs to be updated every time the superblock is
modified, the event count will be updated forever active-clean or
clean-active transition.  All the drives in an array must have the
same value for the event count, so the spares need to be updated even
though they, themselves, aren't exactly 'active' or 'clean'.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Spare disk could not sleep / standby

2005-03-07 Thread Neil Brown
On Tuesday March 8, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  Then after 20ms with no write, they are all marked 'clean'.
  Then before the next write they are all marked 'active'.
  
  As the event count needs to be updated every time the superblock is
  modified, the event count will be updated forever active-clean or
  clean-active transition.
 
 So..  Sorry if I'm a bit slow here.. But what you're saying is:
 
 The kernel marks the partition clean when all writes have expired to disk.
 This change is propagated through MD, and when it is, it causes the
 event counter to rise, thus causing a write, thus marking the
 superblock active.  20 msecs later, the same scenario repeats itself.
 
 Is my perception of the situation correct?

No.  Writing the superblock does not cause the array to be marked
active.
If the array is idle, the individual drives will be idle.


 
 Seems like a design flaw to me, but then again, I'm biased towards
 hating this behaviour since I really like being able to put inactive
 RAIDs to sleep..

Hmmm... maybe I misunderstood your problem.  I thought you were just
talking about a spare not being idle when you thought it should be.
Are you saying that your whole array is idle, but still seeing writes?
That would have to be something non-md-specific I think.

NeilBrown

 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Creating RAID1 with missing - mdadm 1.90

2005-03-05 Thread Neil Brown
On Saturday March 5, [EMAIL PROTECTED] wrote:
 What might the proper [or functional] syntax be to do this?
 
 I'm running 2.6.10-1.766-FC3, and mdadm 1.90.

It would help if you told us what you tried as then we could possible
give a more focussed answer, however:


   mdadm --create /dev/md1 --level=raid1 --raid-devices=2 /dev/sda3 missing

might be the sort of thing you want.

NeilBrown

 
 Thanks for the time.
 b-
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-6 hang on write.

2005-03-01 Thread Neil Brown
On Tuesday March 1, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  
  Could you please confirm if there is a problem with
  2.6.11-rc4-bk4-bk10
  
  as reported, and whether it seems to be the same problem.
 
 Ok.. are we all ready? I had applied your development patches to all my 
 vanilla 2.6.11-rc4-* 
 kernels. Thus they all exhibited the same problem in the same way as -mm1. 
 Smacks forehead against 
 wall repeatedly

Thanks for following through with this so we know exactly where the
problem is ... and isn't.  And admitting your careless mistake in
public is a great example to all the rest of us who are too shy to do
so - thanks :-)

 
 Oh well, at least we now know about a bug in the -mm patches.
 

Yes, and very helpful to know it is.  Thanks again.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Joys of spare disks!

2005-03-01 Thread Neil Brown
On Wednesday March 2, [EMAIL PROTECTED] wrote:
 
 Is there any sound reason why this is not feasible? Is it just that 
 someone needs to write the code to implement it?

Exactly (just needs to be implemented).

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-6 hang on write.

2005-02-27 Thread Neil Brown
On Friday February 25, [EMAIL PROTECTED] wrote:
 
 Turning on debugging in raid6main.c and md.c make it much harder to hit. So 
 I'm assuming something 
 timing related.
 
 raid6d -- md_check_recovery -- generic_make_request -- make_request -- 
 get_active_stripe

Yes, there is a real problem here.  I see if I can figure out the best
way to remedy it...
However I think you reported this problem against a non -mm kernel,
and the path from md_check_recovery to generic_make_requests only
exists in -mm.

Could you please confirm if there is a problem with
2.6.11-rc4-bk4-bk10

as reported, and whether it seems to be the same problem.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 0 of 9] Introduction

2005-02-20 Thread Neil Brown
On Friday February 18, [EMAIL PROTECTED] wrote:
 Would you recommend to apply this package 
 http://neilb.web.cse.unsw.edu.au/~neilb/patches/linux-devel/2.6/2005-02-18-00/patch-all-2005-02-18-00
 To a 2.6.10 kernel?

No.  I don't think it would apply.
That patch it mostly experimental stuff.  Only apply it if you want to
experiment with the bitmap resync code.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 9 of 9] Optimise reconstruction when re-adding a recently failed drive.

2005-02-17 Thread Neil Brown
On Thursday February 17, [EMAIL PROTECTED] wrote:
 
 NeilBrown wrote:
  When an array is degraded, bit in the intent-bitmap are
  never cleared. So if a recently failed drive is re-added, we only need
  to reconstruct the block that are still reflected in the
  bitmap.
  This patch adds support for this re-adding.
 
 Hi there -
 
 If I understand this correctly, this means that:
 
 1) if I had a raid1 mirror (for example) that has no writes to it since 
 a resync
 2) a drive fails out, and some writes occur
 3) when I re-add the drive, only the areas where the writes occurred 
 would be re-synced?
 
 I can think of a bunch of peripheral questions around this scenario, and 
 bad sectors / bad sector clearing, but I may not be understanding the 
 basic idea, so I wanted to ask first.

You seem to understand the basic idea.
I believe one of the motivators for this code (I didn't originate it)
is when a raid1 has one device locally and one device over a network
connection.

If the network connection breaks, that device has to be thrown
out. But when it comes back, we don't want to resync the whole array
over the network.  This functionality helps there (though there are a
few other things needed before that scenario can work smoothly).

You would only re-add a device if you thought it was OK.  i.e. if it
was a connection problem rather than a media problem, or if you had
resolved any media issues.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.11-rc4 md loops on missing drives

2005-02-15 Thread Neil Brown
On Tuesday February 15, [EMAIL PROTECTED] wrote:
 G'day all,
 
 I'm not really sure how it's supposed to cope with losing more disks than 
 planned, but filling the 
 syslog with nastiness is not very polite.

Thanks for the bug report.  There are actually a few problems relating
to resync/recovery when an array (raid 5 or 6) has lost too many
devices.
This patch should fix them.

NeilBrown


Make raid5 and raid6 robust against failure during recovery.

Two problems are fixed here.
1/ if the array is known to require a resync (parity update),
  but there are too many failed devices,  the resync cannot complete
  but will be retried indefinitedly.
2/ if the array has two many failed drives to be usable and a spare is
  available, reconstruction will be attempted, but cannot work.  This
  also is retried indefinitely.


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c|   12 ++--
 ./drivers/md/raid5.c |   13 +
 ./drivers/md/raid6main.c |   12 
 3 files changed, 31 insertions(+), 6 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~  2005-02-16 11:25:25.0 +1100
+++ ./drivers/md/md.c   2005-02-16 11:25:31.0 +1100
@@ -3655,18 +3655,18 @@ void md_check_recovery(mddev_t *mddev)
 
/* no recovery is running.
 * remove any failed drives, then
-* add spares if possible
+* add spares if possible.
+* Spare are also removed and re-added, to allow
+* the personality to fail the re-add.
 */
-   ITERATE_RDEV(mddev,rdev,rtmp) {
+   ITERATE_RDEV(mddev,rdev,rtmp)
if (rdev-raid_disk = 0 
-   rdev-faulty 
+   (rdev-faulty || ! rdev-in_sync) 
atomic_read(rdev-nr_pending)==0) {
if (mddev-pers-hot_remove_disk(mddev, 
rdev-raid_disk)==0)
rdev-raid_disk = -1;
}
-   if (!rdev-faulty  rdev-raid_disk = 0  
!rdev-in_sync)
-   spares++;
-   }
+
if (mddev-degraded) {
ITERATE_RDEV(mddev,rdev,rtmp)
if (rdev-raid_disk  0

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2005-02-16 11:25:25.0 +1100
+++ ./drivers/md/raid5.c2005-02-16 11:25:31.0 +1100
@@ -1491,6 +1491,15 @@ static int sync_request (mddev_t *mddev,
unplug_slaves(mddev);
return 0;
}
+   /* if there is 1 or more failed drives and we are trying
+* to resync, then assert that we are finished, because there is
+* nothing we can do.
+*/
+   if (mddev-degraded = 1  test_bit(MD_RECOVERY_SYNC, 
mddev-recovery)) {
+   int rv = (mddev-size  1) - sector_nr;
+   md_done_sync(mddev, rv, 1);
+   return rv;
+   }
 
x = sector_nr;
chunk_offset = sector_div(x, sectors_per_chunk);
@@ -1882,6 +1891,10 @@ static int raid5_add_disk(mddev_t *mddev
int disk;
struct disk_info *p;
 
+   if (mddev-degraded  1)
+   /* no point adding a device */
+   return 0;
+
/*
 * find the disk ...
 */

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~   2005-02-16 11:25:25.0 +1100
+++ ./drivers/md/raid6main.c2005-02-16 11:25:31.0 +1100
@@ -1650,6 +1650,15 @@ static int sync_request (mddev_t *mddev,
unplug_slaves(mddev);
return 0;
}
+   /* if there are 2 or more failed drives and we are trying
+* to resync, then assert that we are finished, because there is
+* nothing we can do.
+*/
+   if (mddev-degraded = 2  test_bit(MD_RECOVERY_SYNC, 
mddev-recovery)) {
+   int rv = (mddev-size  1) - sector_nr;
+   md_done_sync(mddev, rv, 1);
+   return rv;
+   }
 
x = sector_nr;
chunk_offset = sector_div(x, sectors_per_chunk);
@@ -2048,6 +2057,9 @@ static int raid6_add_disk(mddev_t *mddev
int disk;
struct disk_info *p;
 
+   if (mddev-degraded  2)
+   /* no point adding a device */
+   return 0;
/*
 * find the disk ...
 */
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with Openmosix

2005-02-14 Thread Neil Brown
On Monday February 14, [EMAIL PROTECTED] wrote:
 Hi, Neil...

Hi.

 
 I use MD driver two year ago with Debian, and run perfectly.

Great!

 
 The machine boot the new kernel a run Ok... but... if I (or another
 process) make a change/write to the raid md system, the computer crash
 with the message:
 
   hdh: Drive not ready for command.
 
 (hdh is the mirror raid1 for hdf disk).

I cannot help thinking that maybe the Drive is not ready for the
command.  i.e. it isn't an md problem.  It isn't an openmosix
problem.  It is a drive hardware problem, or maybe an IDE controller
problem.   Can you try a different drive? Can you try just putting a
filesystem on that drive alone and see if it works?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Bugme-new] [Bug 4211] New: md configuration destroys disk GPT label

2005-02-14 Thread Neil Brown
On Monday February 14, [EMAIL PROTECTED] wrote:
 Maybe I am confused, but if you use the whole disk, I would expect the whole
 disk could be over-written!  What am I missing?

I second that.

Once you do anything to a whole disk, whether make an md array out of
it, or mkfs it or anything else, you can kiss any partitioning
goodbye.

Maybe what you want to do it make an md array and then partition
that.
In 2.6 you can do that directly.  In 2.4 you would need to use LVM to
partition the array.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ANNOUNCE: mdadm 1.9.0 - A tool for managing Soft RAID under Linux

2005-02-03 Thread Neil Brown


I am pleased to announce the availability of 
   mdadm version 1.9.0
It is available at
   http://www.cse.unsw.edu.au/~neilb/source/mdadm/
and
   http://www.{countrycode}.kernel.org/pub/linux/utils/raid/mdadm/

as a source tar-ball and (at the first site) as an SRPM, and as an RPM for i386.

mdadm is a tool for creating, managing and monitoring
device arrays using the md driver in Linux, also
known as Software RAID arrays.

Release 1.9.0 adds:
-   Fix rpm build problem (stray %)
-   Minor manpage updates
-   Change dirty status to active as it was confusing people.
-   --assemble --auto recognises 'standard' name and insists on using
the appropriate major/minor number for them.
-   Remove underscore from partition names, so partitions of 
foo are foo1, foo2 etc (unchanged) and partitions of
f00 are f00p1, f00p2 etc rather than f00_p1...
-   Use major, minor, makedev macros instead of 
MAJOR, MINOR, MKDEV so that large device numbers work
on 2.6 (providing you have glibc 2.3.3 or later).
-   Add some missing closes of open file descriptors.
-   Reread /proc/partition for every array assembled when using
it to find devices, rather than only once.
-   Make mdadm -Ss stop stacked devices properly, by reversing the
order in which arrays are stopped.
-   Improve some error messages.
-   Allow device name to appear before first option, so e.g.
mdadm /dev/md0 -A /dev/sd[ab]
works.
-   Assume '-Q' if just a device is given, rather than being silent.

This is based on 1.8.0 and *not* on 1.8.1 which was meant to be a pre-release 
for the upcoming 2.0.0.  The next prerelease will have a more obvious name.

Development of mdadm is sponsored by [EMAIL PROTECTED]: 
  The School of Computer Science and Engineering
at
  The University of New South Wales

NeilBrown  04 February 2005

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ANNOUNCE: mdadm 1.9.0 - A tool for managing Soft RAID under Linux

2005-02-03 Thread Neil Brown
On Friday February 4, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
 
  Release 1.9.0 adds:
 ...
  -   --assemble --auto recognises 'standard' name and insists on using
  the appropriate major/minor number for them.
 
 Is this the problem I encountered when I added auto=md to my mdadm.conf 
 file?

Probably.

 
 It caused all sorts of problems - which were recoverable, fortunately.
 
 I ended up putting a '/sbin/MAKEDEV md' into /etc/rc.sysinit just before 
 the call to mdadm, but that creates all the md devices, not just those 
 that are needed.
 
 Will this new version allow me to remove this line in rc.sysinit again 
 and put the 'auto=md' back into mdadm.conf?

I think so, yes.  It is certainly worth a try and I would appreciate
success of failure reports.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Change preferred minor number of an md device?

2005-01-31 Thread Neil Brown
On Monday January 31, [EMAIL PROTECTED] wrote:
 Hi to all, md gurus!
 
 Is there a way to edit the preferred minor of a stopped device?

mdadm --assemble /dev/md0 --update=super-minor /dev/

will assemble the array and update the preferred minor to 0 (from
/dev/md0).

However this won't work for you as you already have a /dev/md0
running...

 
 Alternatively, is there a way to create a raid1 device specifying the 
 preferred minor number md0, but activating it provisionally as a different 
 minor, say md5? An md0 is already running, so mdadm --create /dev/md0 
 fails...
 
 I have to dump my /dev/md0 to a different disk (/dev/md5), but when I boot 
 from the new disk, I want the kernel to autmatically detect the device 
 as /dev/md0.

If you are running 2.6, then you just need to assemble it as /dev/md0
once and that will automatically update the superblock.  You could do
this with kernel parameters of 
   raid=noautodetect md=0,/dev/firstdrive,/dev/seconddrive

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: /dev/md* Device Files

2005-01-26 Thread Neil Brown
On Wednesday January 26, [EMAIL PROTECTED] wrote:
  A useful trick I discovered yesterday: Add --auto to your mdadm commandline
  and it will create the device for you if it is missing :)
 
 
 Well, it seems that this machine is using the udev scheme for managing 
 device files. I didn't realize this as udev is new to me, but I probably 
 should have mentioned the kernel version (2.6.8) I was using. So I need to 
 research udev and how one causes devices to be created, etc.

Beware udev has an understanding of how device files are meant to
work which is quite different from how md actually works.

udev thinks that devices should appear in /dev after the device is
actually known to exist in the kernel.  md needs a device to exist in
/dev before the kernel can be told that it exists.

This is one of the reasons that --auto was added to mdadm - to bypass
udev.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software RAID 0+1 with mdadm.

2005-01-26 Thread Neil Brown
On Wednesday January 26, [EMAIL PROTECTED] wrote:
 This bug that's fixed in 1.9.0, is in a bug when you create the array?  ie
 do we need to use 1.9.0 to create the array.  I'm looking to do the same but
 my bootdisk currently only has 1.7.soemthing on it.  Do I need to make a
 custom bootcd with 1.9.0 on it?

This issue that will be fixed in 1.9.0 has nothing to do with creating
the array.

It is only relevant for stacked arrays (e.g. a raid0 made out of 2 or
more raid1 arrays), and only if you are using
   mdadm --assemble --scan
(or similar) to assemble your arrays, and you specify the devices to
scan in mdadm.conf as
   DEVICES partitions
(i.e. don't list actual devices, just say to get them from the list of
known partitions).

So, no: no need for a custom bootcd.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software RAID 0+1 with mdadm.

2005-01-26 Thread Neil Brown
On Tuesday January 25, [EMAIL PROTECTED] wrote:
 Been trying for days to get a software RAID 0+1 setup. This is on SuSe
 9.2 with kernel 2.6.8-24.11-smp x86_64.
 
 I am trying to setup a RAID 0+1 with 4 250gb SATA drives. I do the
 following:
 
 mdadm --create /dev/md1 --level=0 --chunk=4 --raid-devices=2 /dev/sdb1
 /dev/sdc1
 mdadm --create /dev/md2 --level=0 --chunk=4 --raid-devices=2 /dev/sdd1
 /dev/sde1
 mdadm --create /dev/md0 --level=1 --chunk=4 --raid-devices=2 /dev/md1
 /dev/md2
 
 This all works fine and I can mkreiserfs /dev/md0 and mount it. If I am
 then to reboot /dev/md1 and /dev/md2 will show up in the /proc/mdstat
 but not /dev/md0. So I create a /etc/mdadm.conf like so to see if this
 will work:
 
 DEVICE partitions
 DEVICE /dev/md*
 ARRAY /dev/md2 level=raid0 num-devices=2
 UUID=5e6efe7d:6f5de80b:82ef7843:148cd518
devices=/dev/sdd1,/dev/sde1
 ARRAY /dev/md1 level=raid0 num-devices=2
 UUID=e81e74f9:1cf84f87:7747c1c9:b3f08a81
devices=/dev/sdb1,/dev/sdc1
 ARRAY /dev/md0 level=raid1 num-devices=2  devices=/dev/md2,/dev/md1
 
 
 Everything seems ok after boot. But again no /dev/md0 in /proc/mdstat.
 But then if I do a mdadm --assemble --scan it will then load
 /dev/md0. 

My guess is that you are (or SuSE is) relying on autodetect to
assemble the arrays.  Autodetect cannot assemble an array made of
other arrays.  Just an array made of partitions.

If you disable the autodetect stuff and make sure 
  mdadm --assemble --scan
is in a boot-script somewhere, it should just work.

Also, you don't really want the device=/dev/sdd1... entries in
mdadm.conf.
They tell mdadm to require the devices to have those names.  If you
add or remove scsi drives at all, the names can change.  Just rely on
the UUID.

 
 Also do I need to create partitions? Or can I setup the whole drives as
 the array?

You don't need partitions.

 
 I have since upgraded to mdadm 1.8 and setup a RAID10. However I need
 something that is production worthy. Is a RAID10 something I could rely
 on as well? Also under a RAID10 how do you tell it which drives you want
 mirrored?

raid10 is 2.6 only, but should be quite stable.
You cannot tell it which drives to mirror because you shouldn't care.
You just give it a bunch of identical drives and let it put the data
where it wants.

If you really want to care (and I cannot imagine why you would - all
drives in a raid10 are likely to get similar load) then you have to
build it by hand - a raid0 of multiple raid1s.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migrating raid-1 to different drive geometry ?

2005-01-24 Thread Neil Brown

On Monday January 24, [EMAIL PROTECTED] wrote:
 how can the existing raid setup be moved to the new disks 
 without data loss ?
 
 I guess it must be something like this:
 
 1) physically remove first old drive
 2) physically add first new drive
 3) re-create partitions on new drive
 4) run raidhotadd for each partition
 5) wait until all partitions synced
 6) repeat with second drive

Sounds good.
 
 the big question is: since the drive geometry will definitely different
 between old 60GB and new 80GB drive(s), how do the new partitions 
 have to be created on the new drive ?
 - do they have to have exactly the same amount of blocks ?
No.
 - may they be bigger ?
Yes (they cannot be smaller).

However making the partitions bigger will not make the arrays bigger.

If you are using a recent 2.6 kernel and mdadm 1.8.0, you can grow the
array with
   mdadm --grow /dev/mdX --size=max

You will then need to convince the filesystem in the array to make use
of the extra space.  Many filesystems do support such growth.  Some
even support on-line growth.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migrating raid-1 to different drive geometry ?

2005-01-24 Thread Neil Brown
On Tuesday January 25, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  If you are using a recent 2.6 kernel and mdadm 1.8.0, you can grow the
  array with
 mdadm --grow /dev/mdX --size=max
 
 Neil,
 
 Is this just for RAID1? OR will it work for RAID5 too?

 --grow --size=max

should work for raid 1,5,6.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid 01 vs 10

2001-07-09 Thread Neil Brown

On Monday July 9, [EMAIL PROTECTED] wrote:
 
 I was wondering what people thought of using raid 0+1 (a mirrored array
 of raid0 stripes) vs. raid 1+0 (a raid0 array of mirrored disks). It
 seems that either can sustain at least one drive failure and the
 performance should be similar. Are there strong reasons for using one
 over the other?

All other things being equal, raid 1+0 is usually better.
It can withstand a greater variety of 2 disc failures and the separate
arrays can rebuild in parallel after an unclean shutdown, thus
returning you to full redundancy more quickly.

But some times other things are not equal.
If you don't have uniform drive sizes, you might want to raid0
assorted drives together to create two similar sized sets to raid1.

I recall once someone suggesting that with certain cabling geometries
it was better to use 0+1 in cases of cable failure, but I cannot
remember, or work out, how that might have been.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



PATCH/RFC - partitioning of md devices

2001-07-01 Thread Neil Brown


Linus,
 I wonder if you would consider applying, or commenting on this patch.

 It adds support for partitioning md devices.  In particular, a new
 major device is created (name==mdp, number assigned dynamically)
 which provides for 15 partitions on each of the first 16 md devices.

 I understand that a more uniform approach to partitioning might get
 introduced in 2.5, but this seems the best approach for 2.4.

 This is particularly useful if you want to have a mirrored boot
 drive, rather than two drives with lots of mirrored partitions.

 It is also useful for supporting what I call winware raid, which is
 the raid-controller requivalent of winmodems - minimal hardware and
 most of the support done in software.

 Among the things that this patch does are:
 
  1/ tidy up some terminology.  Currently there is a one-to-one
  mapping between minor numbers and raid arrays or units, so the
  term minor is used when referring to either read minor number or
  to a unit.
  This patch introduces the term unit to be used to identify which
  particular array is being referred to, and keeps minor just for
  when a minor device number is realy implied.

  2/ When reporting the geometry of a partitioned raid1 array, the
  geometry of the underlying device is reported.  For all other arrays
  the 2x4xLARGE geometry is maintained.

  3/ The hardsectsize of partitions in a RAID5 array is set the the
PAGESIZE because raid5 doesn't cope well with receiving requests
with different blocksizes.

  4/ The new device reports a name of md (via hd_struct-major_name)
so partitions look like mda3 or md/disc0/part3, but registers the
name mdp so that /proc/devices shows the major number next to
mdp.

  5/ devices ioctls for re-reading the partition table and setting
 partition table information.



--- ./include/linux/raid/md.h   2001/07/01 22:59:38 1.1
+++ ./include/linux/raid/md.h   2001/07/01 22:59:47 1.2
@@ -61,8 +61,11 @@
 extern int md_size[MAX_MD_DEVS];
 extern struct hd_struct md_hd_struct[MAX_MD_DEVS];
 
-extern void add_mddev_mapping (mddev_t *mddev, kdev_t dev, void *data);
-extern void del_mddev_mapping (mddev_t *mddev, kdev_t dev);
+extern int mdp_size[MAX_MDP_DEVSMDP_MINOR_SHIFT];
+extern struct hd_struct mdp_hd_struct[MAX_MDP_DEVSMDP_MINOR_SHIFT];
+
+extern void add_mddev_mapping (mddev_t *mddev, int unit, void *data);
+extern void del_mddev_mapping (mddev_t *mddev, int unit);
 extern char * partition_name (kdev_t dev);
 extern int register_md_personality (int p_num, mdk_personality_t *p);
 extern int unregister_md_personality (int p_num);
--- ./include/linux/raid/md_k.h 2001/07/01 22:59:38 1.1
+++ ./include/linux/raid/md_k.h 2001/07/01 22:59:47 1.2
@@ -15,6 +15,7 @@
 #ifndef _MD_K_H
 #define _MD_K_H
 
+
 #define MD_RESERVED   0UL
 #define LINEAR1UL
 #define STRIPED   2UL
@@ -60,7 +61,10 @@
 #error MD doesnt handle bigger kdev yet
 #endif
 
+#defineMDP_MINOR_SHIFT 4
+
 #define MAX_MD_DEVS  (1MINORBITS)/* Max number of md dev */
+#define MAX_MDP_DEVS  (1(MINORBITS-MDP_MINOR_SHIFT)) /* Max number of md dev */
 
 /*
  * Maps a kdev to an mddev/subdev. How 'data' is handled is up to
@@ -73,11 +77,17 @@
 
 extern dev_mapping_t mddev_map [MAX_MD_DEVS];
 
+extern int mdp_major;
 static inline mddev_t * kdev_to_mddev (kdev_t dev)
 {
-   if (MAJOR(dev) != MD_MAJOR)
+   int unit=0;
+   if (MAJOR(dev) == MD_MAJOR)
+   unit = MINOR(dev);
+   else if (MAJOR(dev) == mdp_major)
+   unit = MINOR(dev)  MDP_MINOR_SHIFT;
+   else
BUG();
-return mddev_map[MINOR(dev)].mddev;
+   return mddev_map[unit].mddev;
 }
 
 /*
@@ -191,7 +201,7 @@
 {
void*private;
mdk_personality_t   *pers;
-   int __minor;
+   int __unit;
mdp_super_t *sb;
int nb_dev;
struct md_list_head disks;
@@ -248,13 +258,34 @@
  */
 static inline int mdidx (mddev_t * mddev)
 {
-   return mddev-__minor;
+   return mddev-__unit;
+}
+
+static inline int mdminor (mddev_t *mddev)
+{
+   return mdidx(mddev);
+}
+
+static inline int mdpminor (mddev_t *mddev)
+{
+   return mdidx(mddev) MDP_MINOR_SHIFT;
+}
+
+static inline kdev_t md_kdev (mddev_t *mddev)
+{
+   return MKDEV(MD_MAJOR, mdminor(mddev));
 }
 
-static inline kdev_t mddev_to_kdev(mddev_t * mddev)
+static inline kdev_t mdp_kdev (mddev_t *mddev, int part)
 {
-   return MKDEV(MD_MAJOR, mdidx(mddev));
+   return MKDEV(mdp_major, mdpminor(mddev)+part);
 }
+
+#define foreach_part(tmp,mddev)\
+   if (mdidx(mddev)MAX_MDP_DEVS)  \
+   for(tmp=mdpminor(mddev);\
+   tmpmdpminor(mddev)+(1MDP_MINOR_SHIFT);   \
+   tmp++)
 
 extern 

Re: Failed disk triggers raid5.c bug?

2001-06-26 Thread Neil Brown

On Monday June 25, [EMAIL PROTECTED] wrote:
 Is there any way for the RAID code to be smarter when deciding 
 about those event counters? Does it have any chance (theoretically)
 to _know_ that it shouldn't use the drive with event count 28?

My current thinking is that once a raid array becomes unusable - in the
case of raid5, this means two failures - the array should immediately
be marked read-only, including the superblocks.   Then if you ever
manage to get enough drives together to form a working array, it will
start working again, and if not, it won't really matter whether the
superblock was updated to not.



 And even if that can't be done automatically, what about a small
 utility for the admin where he can give some advise to support 
 the RAID code on those decisions?
 Will mdctl have this functionality? That would be great!

mdctl --assemble will have a --force option to tell it to ignore
event numbers and assemble the array anyway.  This could result in
data corruption if you include an old disc, but would be able to get
you out of a tight spot.  Ofcourse, once the above change goes into
the kernel it shouldn't be necessary.

  
 Hm, does the RAID code disable a drive on _every_ error condition?
 Isn't there a distinction between, let's say, soft errors and hard
 errors?
 (I have to admit I don't know the internals of Linux device drivers
 enough to answer that question)
 Shouldn't the RAID code leave a drive which reports soft errors
 in the array and disable drives with hard errors only?

A Linux block device doesn't report soft errors. There is either
success or failure.  The driver for the disc drive should retry any
soft errors and only report an error up through the block-device layer
when it is definately hard.

Arguably the RAID layer should catch read errors and try to get the
data from elsewhere and then re-write over the failed read, just
incase it was a single block error.
But a write error should always be fatal and fail the drive. I cannot
think of any other reasonable approach.

 
 In that case, the filesystem might have been corrupt, but the array
 would have been re-synced automatically, wouldn't it?

yes, and it would have if it hadn't collapsed in a heap while trying :-(

 
   But why did the filesystem ask for a block that was out of range?
   This is the part that I cannot fathom.  It would seem as though the
   filesystem got corrupt somehow.  Maybe an indirect block got replaced
   with garbage, and ext2fs believed the indirect block and went seeking
   way off the end of the array.  But I don't know how the corruption
   happened.
  
 Perhaps the read errors from the drive triggered that problem?

They shouldn't do, but seeing don't know where the corruption came
from, and I'm not even 100% confident that there was corruption, maybe
they could.

The closest I can come to a workable scenario is that maybe some
parity block had the wrong data.  Normally this wouldn't be noticed,
but when you have a failed drive you have to use the parity to
calculate the value of a missing block, and bad parity would make this
block bad.  But I cannot imaging who you would have a bad parity
block.  After any unclean shutdown the parity should be recalculated.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



Re: Mounting very old style raid on a recent machine?

2001-06-26 Thread Neil Brown

On Tuesday June 26, [EMAIL PROTECTED] wrote:
 Hi,
 
 I currently have to salvage data from an ancient box that looks like
 to have run kernel 2.0.35. However, the system on that disk is
 corrupted and won't boot any more (at least not on today's hardware).
 It looks like main data is on a RAID.
 
 /etc/mdtab:
 |/dev/md0 linear  /dev/hda3   /dev/hdb2
 
 Can I access that RAID from a current system running kernel 2.2 or
 2.4? Do I have to build a new 2.0 kernel? What type of raidtools do I
 need to activate that RAID?

You should be able to access this just fine from a 2.2 kernel using
raidtools-0.41 from
http://www.kernel.org/pub/linux/daemons/raid/

If you need to use 2.4, then you should still be able to access it
using raidtools 0.90 from
   http://www.kernel.org/pub/linux/daemons/raid/alpha/

in this case you would need an /etc/raidtab like
-
raiddev /dev/md0
raid-level linear
nr-raid-disks 2
persistent-superblock 0

device /dev/hda3
raid-disk 0
device /dev/hdb2
raid-disk 1
---

Note that this is pure theory.  I have never actually done it myself.

It should be quite safe to experiment.  You are unlikely to corrupt
anything if you don't to anything outrageously silly like telling it
that it is a raid1 or raid5 array.

Note: the persistent-superblock 0 is fairly important. These older
arrays did not have any raid-superblock on the device.  You want to
make sure you don't accidentally write one and so corrupt data.

I would go for a 2.2 kernel, raidtools  0.41 and the command:

 mdadd -r -pl /dev/md0 /dev/hda3 /dev/hdb2

NeilBrown


 
 Any hints will be appreciated.
 
 Greetings
 Marc
 
 -- 
 -
 Marc Haber | I don't trust Computers. They | Mailadresse im Header
 Karlsruhe, Germany |  lose things.Winona Ryder | Fon: *49 721 966 32 15
 Nordisch by Nature |  How to make an American Quilt | Fax: *49 721 966 31 29
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



Re: PATCH - raid5 performance improvement - 3 of 3

2001-06-24 Thread Neil Brown

On Sunday June 24, [EMAIL PROTECTED] wrote:
 Hi,
 
 We used to (long ago, 2.2.x), whenever we got a write request for some
 buffer,
 search the buffer cache to see if additional buffers which belong to that
 particular stripe are dirty, and then schedule them for writing as well, in
 an
 attempt to write full stripes. That resulted in a huge sequential write
 performance
 improvement.
 
 If such an approach is still possible today, it is preferrable to delaying
 the writes

 It is not still possible, at least not usefully so. Infact, it is
 also true that it is probably not preferrable. 

 Since about 2.3.7, filesystem data has not, by-and-large, been stored
 in the buffer cache.  It is only stored in the page cache.  So were
 raid5 to go looking in the buffer cache it would be unlikely to find
 anything. 

 But there are other problems.  The cache snooping only works if the
 direct client of raid5 is a filesystem that stores data in the buffer
 cache.  If the filesystem is an indirect client, via LVM for example,
 or even via a RAID0 array, then raid5 would not be able to look in
 the right buffer cache, and so would find nothing.  This was the
 case in 2.2.  If you tried an LVM over RAID5 in 2.2, you wouldn't get
 good write speed.  You also would probably get data corruption while
 the array was re-syncing, but that is a separate issue.

 The current solution is much more robust.  It cares nothing about the
 way the raid5 array is used. 

 Also, while the handling of stripes is delayed, I don't believe that
 this would actually show as measurable increase in latency.  The
 effect is really to have requests spend more time on a higher level
 queue, and less time on a lower level queue.  The total time on
 queues should normally be the same or less (due to improved
 throughput) or only very slightly more in pathological cases.

NeilBrown



 for the partial buffer while hoping that the rest of the bufferes in the
 stripe would
 come as well, since it both eliminates the additional delay, and doesn't
 depend on the order in which the bufferes are flushed from the much bigger
 memory buffers to the smaller stripe cache.
 

I think the ideal solution would be to have the filesystem write data
in two stages, much like Unix apps can.
As soon as a buffer is dirtied (or more accurately, as soon as the
filesystem is happy for the data to be written), it is passed on with a
WRITE_AHEAD request.  The driver is free to do what it likes,
including ignore this.
Later, at a time corresponding to fsync or close maybe, or when
memory is tight, the filesystem can send the buffer down with a
WRITE request which says please write this *now*.

RAID5 could then gather all the write_ahead requests into a hash table
(not unlike the old buffer cache), and easily find full stripes for
writing.

But that is not going to happen in 2.4.

NeilBrown


 Cheers,
 
 Gadi
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



Re: Failed disk triggers raid5.c bug?

2001-06-24 Thread Neil Brown

On Sunday June 24, [EMAIL PROTECTED] wrote:
 Hi!
 
 Neil Brown wrote:
  On Thursday June 14, [EMAIL PROTECTED] wrote:
   
   Dear All
   
   I've just had a disk (sdc) fail in my raid5 array (sdb sdc sdd),
 
  Great!  A real live hardware failure1  It is always more satisfying to
  watch one of those than to have to simulate them all the time!!
  Unless of course they are fatal... not the case here it seems.
 
 Well, here comes a _real_ fatal one...

And a very detailed report it was, thanks.

I'm not sure that  you want to know this, but it looks like you might
have been able to recover your data though it is only a might.

 Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target0/lun0/part4's sb offset: 
16860096 [events: 0024]
 Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target1/lun0/part4's sb offset: 
16860096 [events: 0024]
 Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target2/lun0/part4's sb offset: 
16860096 [events: 0023]
 Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target3/lun0/part4's sb offset: 
16860096 [events: 0028]

The reason that this array couldn't restart was that the 4th drive had
the highest event count and it was alone in this.  It didn't even have
any valid data!!
Had you unplugged this drive and booted, it would have tried to
assemble an array out of the first two (event count 24).  This might
have worked (though it might not, read on).

Alternately, you could have created a raidtab which said that the
third drive was failed, and then run mkraid...

mdctl, when it is finished, should be able to make this all much
easier.

But what went wrong?  I don't know the whole story but:

- On the first error, the drive was disabled and reconstruction was
  started. 
- On the second error, the reconstruction was inapprporiately
  interrupted.  This is an error that I will have to fix in 2.4.
  However it isn't really fatal error.
- Things were then going fairly OK, though noisy, until:

 Jun 19 09:10:07 wien kernel: attempt to access beyond end of device
 Jun 19 09:10:07 wien kernel: 08:04: rw=0, want=1788379804, limit=16860217
 Jun 19 09:10:07 wien kernel: dev 09:00 blksize=1024 blocknr=447094950 
sector=-718207696 size=4096 count=1

 For some reason, it tried to access well beyond the end of one of the
 underlying drives.  This caused that drive to fail.  This relates to
 the subsequent message:

 Jun 19 09:10:07 wien kernel: raid5: restarting stripe 3576759600

 which strongly suggests that the filesystem actually asked the raid5
 array for a block that was well out of range.
 In 2.4, this will be caught before the request gets to raid5.  In 2.2
 it isn't.  The request goes on to raid5, raid5 blindly passes a bad
 request down to the disc.  The disc reports an error, and raid5
 thinks the disc has failed, rather than realise that it never should
 have made such a silly request.

 But why did the filesystem ask for a block that was out of range?
 This is the part that I cannot fathom.  It would seem as though the
 filesystem got corrupt somehow.  Maybe an indirect block got replaced
 with garbage, and ext2fs believed the indirect block and went seeking
 way off the end of the array.  But I don't know how the corruption
 happened.

 Had you known enough to restart the array from the two apparently
 working drives, and then run fsck, it might have fixed things enough
 to keep going.  Or it might not, depending on how much corruption
 there was.

 So, Summary of problems:
  1/ md responds to a failure on a know-failed drive inappropriately. 
This shouldn't be fatal but needs fixing.
  2/ md isn't thoughtful enough about updating the event counter on
 superblocks and can easily leave an array in an unbuildable
 state.  This needs to be fix.  It's on my list...
  3/ raid5 responds to a request for an out-of-bounds device address
by passing on out-of-bounds device addresses the drives, and then
thinking that those drives are failed.
This is fixed in 2.4
  4/ Something caused some sort of filesystem corruption.  I don't
 know what.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



PATCH - md initialisation to accept devfs names

2001-06-20 Thread Neil Brown


Linus, 
  it is possible to start an md array from the boot command line with,
  e.g.
 md=0,/dev/something,/dev/somethingelse

  However only names recognised by name_to_kdev_t work here.  devfs
  based names do not work.
  To fix this, the follow patch moves the name lookup from __setup
  time to __init time so that the devfs routines can be called.

  This patch is largely due to Dave Cinege, though I have made a few
  improvements (particularly removing the devices array from
  md_setup_args). 

  The #ifdef MODULE that this patch removes it wholy within another
  #ifdef MODULE and so it totally pointless.

NeilBrown



--- ./drivers/md/md.c   2001/06/21 00:51:42 1.2
+++ ./drivers/md/md.c   2001/06/21 00:53:09 1.3
@@ -3638,7 +3638,7 @@
char device_set [MAX_MD_DEVS];
int pers[MAX_MD_DEVS];
int chunk[MAX_MD_DEVS];
-   kdev_t devices[MAX_MD_DEVS][MD_SB_DISKS];
+   char *device_names[MAX_MD_DEVS];
 } md_setup_args md__initdata;
 
 /*
@@ -3657,14 +3657,15 @@
  * md=n,device-list  reads a RAID superblock from the devices
  * elements in device-list are read by name_to_kdev_t so can be
  * a hex number or something like /dev/hda1 /dev/sdb
+ * 2001-06-03: Dave Cinege [EMAIL PROTECTED]
+ * Shifted name_to_kdev_t() and related operations to md_set_drive()
+ * for later execution. Rewrote section to make devfs compatible.
  */
-#ifndef MODULE
-extern kdev_t name_to_kdev_t(char *line) md__init;
 static int md__init md_setup(char *str)
 {
-   int minor, level, factor, fault, i=0;
-   kdev_t device;
-   char *devnames, *pername = ;
+   int minor, level, factor, fault;
+   char *pername = ;
+   char *str1 = str;
 
if (get_option(str, minor) != 2) {/* MD Number */
printk(md: Too few arguments supplied to md=.\n);
@@ -3673,9 +3674,8 @@
if (minor = MAX_MD_DEVS) {
printk (md: Minor device number too high.\n);
return 0;
-   } else if (md_setup_args.device_set[minor]) {
-   printk (md: Warning - md=%d,... has been specified twice;\n
-   will discard the first definition.\n, minor);
+   } else if (md_setup_args.device_names[minor]) {
+   printk (md: md=%d, Specified more then once. Replacing previous 
+definition.\n, minor);
}
switch (get_option(str, level)) { /* RAID Personality */
case 2: /* could be 0 or -1.. */
@@ -3706,53 +3706,72 @@
}
/* FALL THROUGH */
case 1: /* the first device is numeric */
-   md_setup_args.devices[minor][i++] = level;
+   str = str1;
/* FALL THROUGH */
case 0:
md_setup_args.pers[minor] = 0;
pername=super-block;
}
-   devnames = str;
-   for (; iMD_SB_DISKS  str; i++) {
-   if ((device = name_to_kdev_t(str))) {
-   md_setup_args.devices[minor][i] = device;
-   } else {
-   printk (md: Unknown device name, %s.\n, str);
-   return 0;
-   }
-   if ((str = strchr(str, ',')) != NULL)
-   str++;
-   }
-   if (!i) {
-   printk (md: No devices specified for md%d?\n, minor);
-   return 0;
-   }
-
+   
printk (md: Will configure md%d (%s) from %s, below.\n,
-   minor, pername, devnames);
-   md_setup_args.devices[minor][i] = (kdev_t) 0;
-   md_setup_args.device_set[minor] = 1;
+   minor, pername, str);
+   md_setup_args.device_names[minor] = str;
+   
return 1;
 }
-#endif /* !MODULE */
 
+extern kdev_t name_to_kdev_t(char *line) md__init;
 void md__init md_setup_drive(void)
 {
int minor, i;
kdev_t dev;
mddev_t*mddev;
+   kdev_t devices[MD_SB_DISKS+1];
 
for (minor = 0; minor  MAX_MD_DEVS; minor++) {
+   int err = 0;
+   char *devname;
mdu_disk_info_t dinfo;
 
-   int err = 0;
-   if (!md_setup_args.device_set[minor])
+   if ((devname = md_setup_args.device_names[minor]) == 0) continue;
+   
+   for (i = 0; i  MD_SB_DISKS  devname != 0; i++) {
+
+   char *p;
+   void *handle;
+   
+   if ((p = strchr(devname, ',')) != NULL)
+   *p++ = 0;
+
+   dev = name_to_kdev_t(devname);
+   handle = devfs_find_handle(NULL, devname, MAJOR (dev), MINOR 
+(dev),
+   DEVFS_SPECIAL_BLK, 1);
+   if (handle != 0) {
+   unsigned major, minor;
+   devfs_get_maj_min(handle, major, minor);
+   

PATCH - tag all printk's in md.c

2001-06-20 Thread Neil Brown


Linus, 
 This patch makes sure that all the printks in md.c print a message
 starting with md: or md%d:.
 The next step (not today) will be to reduce a lot of them to
 KERN_INFO or similar as md is really quite noisy.

 Also, two printks in raid1.c get prefixed with raid1:  
 This patch is partly due to  Dave Cinege.

 While preparing this I noticed that write_disk_sb sometimes returns
 1 for error, sometimes -1, and the return val is added into a
 cumulative error variable.  So now it always returns 1.

 Also md_update_sb reports on the each superblock (one per disk) on
 separate lines, but worries about inserting commas and ending with a
 full stop.   I have removed the commas and fullstop - vestigates of
 shorter device names I suspect.

NeilBrown


--- ./drivers/md/md.c   2001/06/21 00:53:09 1.3
+++ ./drivers/md/md.c   2001/06/21 00:53:39 1.4
@@ -634,7 +634,7 @@
md_list_add(rdev-same_set, mddev-disks);
rdev-mddev = mddev;
mddev-nb_dev++;
-   printk(bind%s,%d\n, partition_name(rdev-dev), mddev-nb_dev);
+   printk(md: bind%s,%d\n, partition_name(rdev-dev), mddev-nb_dev);
 }
 
 static void unbind_rdev_from_array (mdk_rdev_t * rdev)
@@ -646,7 +646,7 @@
md_list_del(rdev-same_set);
MD_INIT_LIST_HEAD(rdev-same_set);
rdev-mddev-nb_dev--;
-   printk(unbind%s,%d\n, partition_name(rdev-dev),
+   printk(md: unbind%s,%d\n, partition_name(rdev-dev),
 rdev-mddev-nb_dev);
rdev-mddev = NULL;
 }
@@ -686,7 +686,7 @@
 
 static void export_rdev (mdk_rdev_t * rdev)
 {
-   printk(export_rdev(%s)\n,partition_name(rdev-dev));
+   printk(md: export_rdev(%s)\n,partition_name(rdev-dev));
if (rdev-mddev)
MD_BUG();
unlock_rdev(rdev);
@@ -694,7 +694,7 @@
md_list_del(rdev-all);
MD_INIT_LIST_HEAD(rdev-all);
if (rdev-pending.next != rdev-pending) {
-   printk((%s was pending)\n,partition_name(rdev-dev));
+   printk(md: (%s was pending)\n,partition_name(rdev-dev));
md_list_del(rdev-pending);
MD_INIT_LIST_HEAD(rdev-pending);
}
@@ -777,14 +777,14 @@
 {
int i;
 
-   printk(  SB: (V:%d.%d.%d) ID:%08x.%08x.%08x.%08x CT:%08x\n,
+   printk(md:  SB: (V:%d.%d.%d) ID:%08x.%08x.%08x.%08x CT:%08x\n,
sb-major_version, sb-minor_version, sb-patch_version,
sb-set_uuid0, sb-set_uuid1, sb-set_uuid2, sb-set_uuid3,
sb-ctime);
-   printk( L%d S%08d ND:%d RD:%d md%d LO:%d CS:%d\n, sb-level,
+   printk(md: L%d S%08d ND:%d RD:%d md%d LO:%d CS:%d\n, sb-level,
sb-size, sb-nr_disks, sb-raid_disks, sb-md_minor,
sb-layout, sb-chunk_size);
-   printk( UT:%08x ST:%d AD:%d WD:%d FD:%d SD:%d CSUM:%08x E:%08lx\n,
+   printk(md: UT:%08x ST:%d AD:%d WD:%d FD:%d SD:%d CSUM:%08x E:%08lx\n,
sb-utime, sb-state, sb-active_disks, sb-working_disks,
sb-failed_disks, sb-spare_disks,
sb-sb_csum, (unsigned long)sb-events_lo);
@@ -793,24 +793,24 @@
mdp_disk_t *desc;
 
desc = sb-disks + i;
-   printk( D %2d: , i);
+   printk(md: D %2d: , i);
print_desc(desc);
}
-   printk( THIS: );
+   printk(md: THIS: );
print_desc(sb-this_disk);
 
 }
 
 static void print_rdev(mdk_rdev_t *rdev)
 {
-   printk( rdev %s: O:%s, SZ:%08ld F:%d DN:%d ,
+   printk(md: rdev %s: O:%s, SZ:%08ld F:%d DN:%d ,
partition_name(rdev-dev), partition_name(rdev-old_dev),
rdev-size, rdev-faulty, rdev-desc_nr);
if (rdev-sb) {
-   printk(rdev superblock:\n);
+   printk(md: rdev superblock:\n);
print_sb(rdev-sb);
} else
-   printk(no rdev superblock!\n);
+   printk(md: no rdev superblock!\n);
 }
 
 void md_print_devices (void)
@@ -820,9 +820,9 @@
mddev_t *mddev;
 
printk(\n);
-   printk(**\n);
-   printk(* COMPLETE RAID STATE PRINTOUT *\n);
-   printk(**\n);
+   printk(md: **\n);
+   printk(md: * COMPLETE RAID STATE PRINTOUT *\n);
+   printk(md: **\n);
ITERATE_MDDEV(mddev,tmp) {
printk(md%d: , mdidx(mddev));
 
@@ -838,7 +838,7 @@
ITERATE_RDEV(mddev,rdev,tmp2)
print_rdev(rdev);
}
-   printk(**\n);
+   printk(md: **\n);
printk(\n);
 }
 
@@ -917,15 +917,15 @@
 
if (!rdev-sb) {
MD_BUG();
-   return -1;
+   return 1;
}
if (rdev-faulty) {
  

PATCH - raid5 performance improvement - 3 of 3

2001-06-20 Thread Neil Brown



Linus, and fellow RAIDers,

 This is the third in my three patch series for improving RAID5
 throughput.
 This one substantially lifts write thoughput by leveraging the
 opportunities for write gathering provided by the first patch.

 With RAID5, it is much more efficient to write a whole stripe full of
 data at a time as this avoids the need to pre-read any old data or
 parity from the discs.

 Without this patch, when a write request arrives, raid5 will
 immediately start a couple of pre-reads so that it will be able to
 write that block and update the parity.
 By the time that the old data and parity arrive it is quite possible
 that write requests for all the other blocks in the stripe will have
 been submitted, and the old data and parity will not be needed.

 This patch uses concepts similar to queue plugging to delay write
 requests slightly to improve the chance that many or even all of the
 data blocks in a stripe will have outstanding write requests before
 processing is started.

 To do this it maintains a queue of stripes that seem to require
 pre-reading.
 Stripes are only released  from this queue when there are no other
 pre-read requests active, and then only if the raid5 device is not
 currently plugged.

 As I mentioned earlier, my testing shows substantial improvements
 from these three patches for both sequential (bonnie) and random
 (dbench) access patterns.  I would be particularly interested if
 anyone else does any different testing, preferable comparing
 2.2.19+patches with 2.4.5 and then with 2.4.5-plus these patches. 

 I know of one problem area being sequential writes to a 3 disc
 array.  If anyone can find any other access patterns that still
 perform below 2.2.19 levels, I would really like to know about them.

NeilBrown



--- ./include/linux/raid/raid5.h2001/06/21 01:01:46 1.3
+++ ./include/linux/raid/raid5.h2001/06/21 01:04:05 1.4
@@ -158,6 +158,32 @@
 #define STRIPE_HANDLE  2
 #defineSTRIPE_SYNCING  3
 #defineSTRIPE_INSYNC   4
+#defineSTRIPE_PREREAD_ACTIVE   5
+#defineSTRIPE_DELAYED  6
+
+/*
+ * Plugging:
+ *
+ * To improve write throughput, we need to delay the handling of some
+ * stripes until there has been a chance that several write requests
+ * for the one stripe have all been collected.
+ * In particular, any write request that would require pre-reading
+ * is put on a delayed queue until there are no stripes currently
+ * in a pre-read phase.  Further, if the delayed queue is empty when
+ * a stripe is put on it then we plug the queue and do not process it
+ * until an unplg call is made. (the tq_disk list is run).
+ *
+ * When preread is initiated on a stripe, we set PREREAD_ACTIVE and add
+ * it to the count of prereading stripes.
+ * When write is initiated, or the stripe refcnt == 0 (just in case) we
+ * clear the PREREAD_ACTIVE flag and decrement the count
+ * Whenever the delayed queue is empty and the device is not plugged, we
+ * move any strips from delayed to handle and clear the DELAYED flag and set 
+PREREAD_ACTIVE.
+ * In stripe_handle, if we find pre-reading is necessary, we do it if
+ * PREREAD_ACTIVE is set, else we set DELAYED which will send it to the delayed queue.
+ * HANDLE gets cleared if stripe_handle leave nothing locked.
+ */
+ 
 
 struct disk_info {
kdev_t  dev;
@@ -182,6 +208,8 @@
int max_nr_stripes;
 
struct list_headhandle_list; /* stripes needing handling */
+   struct list_headdelayed_list; /* stripes that have plugged requests */
+   atomic_tpreread_active_stripes; /* stripes with scheduled io */
/*
 * Free stripes pool
 */
@@ -192,6 +220,9 @@
 * waiting for 25% to be free
 */
md_spinlock_t   device_lock;
+
+   int plugged;
+   struct tq_structplug_tq;
 };
 
 typedef struct raid5_private_data raid5_conf_t;
--- ./drivers/md/raid5.c2001/06/21 01:01:46 1.3
+++ ./drivers/md/raid5.c2001/06/21 01:04:05 1.4
@@ -31,6 +31,7 @@
  */
 
 #define NR_STRIPES 256
+#defineIO_THRESHOLD1
 #define HASH_PAGES 1
 #define HASH_PAGES_ORDER   0
 #define NR_HASH(HASH_PAGES * PAGE_SIZE / sizeof(struct 
stripe_head *))
@@ -65,11 +66,17 @@
BUG();
if (atomic_read(conf-active_stripes)==0)
BUG();
-   if (test_bit(STRIPE_HANDLE, sh-state)) {
+   if (test_bit(STRIPE_DELAYED, sh-state))
+   list_add_tail(sh-lru, conf-delayed_list);
+   else if (test_bit(STRIPE_HANDLE, sh-state)) {
list_add_tail(sh-lru, conf-handle_list);

PATCH

2001-06-20 Thread Neil Brown


Linus,
  There is a buggy BUG in the raid5 code.
 If a request on an underlying device reports an error, raid5 finds out
 which device that was and marks it as failed.  This is fine.
 If another request on the same device reports an error, raid5 fails
 to find that device in its table (because though  it is there, it is
 not operational), and so it thinks something is wrong and calls
 MD_BUG() - which is very noisy, though not actually harmful (except
 to the confidence of the sysadmin)
 

 This patch changes the test so that a failure on a drive that is
 known but not-operational will be Expected and node a BUG.

NeilBrown

--- ./drivers/md/raid5.c2001/06/21 01:04:05 1.4
+++ ./drivers/md/raid5.c2001/06/21 01:04:41 1.5
@@ -486,22 +486,24 @@
PRINTK(raid5_error called\n);
conf-resync_parity = 0;
for (i = 0, disk = conf-disks; i  conf-raid_disks; i++, disk++) {
-   if (disk-dev == dev  disk-operational) {
-   disk-operational = 0;
-   mark_disk_faulty(sb-disks+disk-number);
-   mark_disk_nonsync(sb-disks+disk-number);
-   mark_disk_inactive(sb-disks+disk-number);
-   sb-active_disks--;
-   sb-working_disks--;
-   sb-failed_disks++;
-   mddev-sb_dirty = 1;
-   conf-working_disks--;
-   conf-failed_disks++;
-   md_wakeup_thread(conf-thread);
-   printk (KERN_ALERT
-   raid5: Disk failure on %s, disabling device.
-Operation continuing on %d devices\n,
-   partition_name (dev), conf-working_disks);
+   if (disk-dev == dev) {
+   if (disk-operational) {
+   disk-operational = 0;
+   mark_disk_faulty(sb-disks+disk-number);
+   mark_disk_nonsync(sb-disks+disk-number);
+   mark_disk_inactive(sb-disks+disk-number);
+   sb-active_disks--;
+   sb-working_disks--;
+   sb-failed_disks++;
+   mddev-sb_dirty = 1;
+   conf-working_disks--;
+   conf-failed_disks++;
+   md_wakeup_thread(conf-thread);
+   printk (KERN_ALERT
+   raid5: Disk failure on %s, disabling device.
+Operation continuing on %d devices\n,
+   partition_name (dev), conf-working_disks);
+   }
return 0;
}
}
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



Re: du discrepancies?

2001-06-14 Thread Neil Brown

On Friday June 15, [EMAIL PROTECTED] wrote:
 There appears to be a discrepancy between the true state of affairs on my 
 RAID partitions and what df reports;
 
 [root /]# sfdisk -l /dev/hda
  
 Disk /dev/hda: 38792 cylinders, 16 heads, 63 sectors/track
 Units = cylinders of 516096 bytes, blocks of 1024 bytes, counting from 0
  
Device Boot Start End   #cyls   #blocks   Id  System
 /dev/hda1  0+   15231524-   768095+  fd  Linux raid autodetect
 /dev/hda2   15241845 3221622885  Extended
 /dev/hda3   18462252 407205128   fd  Linux raid autodetect
 /dev/hda4   2253   38791   36539  18415656   fd  Linux raid autodetect
 /dev/hda5   1524+   1584  61-30743+  83  Linux
 /dev/hda6   1585+   1845 261-   131543+  82  Linux swap
 
 [root /]# df
 Filesystem   1k-blocks  Used Available Use% Mounted on
 /dev/md1755920666748 50772  93% /
 WRONG
 /dev/md3198313 13405174656   7% /var 
 WRONG
 /dev/md4  18126088118288  17087024   1% /home WRONG
 
 These figures are clearly wrong. Can anyone suggest where I should start 
 looking for an explanation?

How can figures be wrong?  They are just figures.
What do you think is wrong about them??

Anyway, for a more useful response...
I assume that md[134] are RAID1 arrays, with one mirror on hda.
Lets take md1 made in part of hda1

hda1 has 768095 1K blocks.
md/raid rounds down to a multiple of 64K, and then removes the last
64k for the raid super block, leaving
  768000 1K blocks.
ext2fs uses some of this for metad, and reports the rest as the
available space.
The overhead space compises the superblocks, the block group
descriptors, and inode bitmaps, the block bitmaps, and the inode
tables.
This seems to add up to 12080K on this filesystem, about 1.6%.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



Re: mdctl - names and code

2001-06-13 Thread Neil Brown


Thankyou for all the suggestions for names for mdctl.
We have

 raidctl raidctrl
 swraidctl
 mdtools mdutils 
 mdmanage mdmgr mdmd:-) mdcfg mdconfig mdadmin

Mike Black suggested that it is valuable for tools that are related to
start with a common prefix so that command completion can be used
to find them.
I think that is very true but, in this case, irrelevant.
mdctl (to whatever it gets called) will be one tool that does
everything.

I might arrange - for back compatability - that if you call it with a
name like 
   raidhotadd
it will default to the hot-add functionality, but I don't expect that
to be normal usage.

I have previously said that I am not very fond of raidctl as raid is
a bit too generic.  swraidctl is better but harder to pronouce.

I actually rather like md.  It has a pleasing similarity to mt.
Also
   man 1 md   -- would document the user interface
   man 5 md   -- would document the device driver.
This elegant.  But maybe a bit too terse.

I'm currently leaning towards mdadm or mdadmin as it is easy to
pronounce (for my palate anyway) and has the right sort of meaning.

I think I will keep that name at mdctl until I achieve all the
functionality I want, and then when I release v1.0 I will change the
name to what seems best at the time.

Thanks again for the suggestions and interest.

For the eager, 
  http://www.cse.unsw.edu.au/~neilb/source/mdctl/mdctl-v0.3.tgz
contains my latest source which has most of the functionality in
place, though it doesn't all work quite right yet.
You can create a raid5 array with:

  mdctl --create /dev/md0 --level raid5 --chunk 32 --raid-disks 3 \
/dev/sda /dev/sdb /dev/sdc

and stop it with
  mdctl --stop /dev/md0

and the assemble it with

  mdctl --assemble /dev/md0 /dev/sda /dev/sdb /dev/sdc

I wouldn't actually trust a raid array that you build this way
though.  Some fields in the super block aren't right yet.

I am very interested in comments on the interface and usability.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



Re: mdctl

2001-06-11 Thread Neil Brown

On Friday June 8, [EMAIL PROTECTED] wrote:
 On Fri, 8 Jun 2001, Neil Brown wrote:
 
  If you don't like the name mdctl (I don't), please suggest another.
 
 How about raidctrl?

Possible... though I don't think it is much better.  Longer to type too:-)

I kind of like having the md in there as it is the md driver.
raid is a generic term, and mdctl doesn't work with all raid
(i.e. not firmware raid), only software raid, and in particular, only
the md driver.

But thanks for the suggestion, I will keep it in mind and see if it
grows on me.

NeilBrown


 
 -- 
 MfG / Regards
 Friedrich Lobenstock
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



Re: failure of raid 5 when first disk is unavailable

2001-06-07 Thread Neil Brown

On Thursday June 7, [EMAIL PROTECTED] wrote:
 Hi Neil;
 
 I am hoping you are going to tell me this is already solved,
 but here goes...

Almost :-)

 
 scenario:
   hda4, hdb4, and hdc4 in a raid 5 with no hotspare.
 
 
 With 2.4.3 XFS kernels, it seems that a raid 5 does not come
 up correctly if the first disk is unavailable.  The error message
 that arises in the syslog from the md driver is:
 
 md: could not lock hda4, zero size?  marking faulty
 md: could not import hda4.
 md: autostart hda4 failed!
 

Yep. This happens when you use raidstart.
It doesn't happen if you set the partition type to LINUX_RAID and use
the autodetect functionality.

raidstart just takes one drive, gives it to the kernel, and say look
in the superblock for the major/minor of the other devices.

This has several failure modes.

It is partly for this reason that I am writting a replacement md
management tool - mdctl.

I wasn't going to announce it just yet because it is very incomplete,
but you have pushed me into it :-)

 http://www.cse.unsw.edu.au/~neilb/source/mdctl/mdctl-v0.2.tgz

is a current snapshot.  I compiles (for me) and 
   mdctl --help

works. mdctl --Assemble is nearly there.

Comments welcome.

In 2.2, there is no other way to start an array than give one device
to the kernel and tell it to look for others.  So mdctl will
find the device numbers of the devices in the array and re-write the
super block if necessary to make the array start.

In 2.4, mdctl can use a 
  SET_ARRAY_INFO / ADD_NEW_DISK* / RUN_ARRAY
sequence to start a new array.

If you don't like the name mdctl (I don't), please suggest another.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



Re: mdrecoveryd invalid operand error

2001-06-06 Thread Neil Brown

On Wednesday June 6, [EMAIL PROTECTED] wrote:
 In the XFS kernel tree v2.4.3 w/ several patches,
 we were unable to raidhotremove and subsequently
 raidhotadd a spare without a reboot.  It did not
 matter if you had a new or the same hard disk.  We then
 tried the patch Igno Molnar sent regarding the issue.
 
 http://www.mail-archive.com/linux-raid@vger.kernel.org/msg00551.html
 
 This solved the problem of not doing a reboot and trying
 to switch a hotspare and faulty drive.
 
 In addition, however, we are seeing a kernel panic using
 raid 5 when mdrecoveryd starts when after doing the hotspare
 and faulty drive swap a second time without a reboot.
...
 
 Do you have any suggestions Neil?

Yep.  Upgrade to the latest kernel! (Don't you just hate it when that
turns out to be the answer).

Ingo's patch is half right, but not quite there.

 http://www.cse.unsw.edu.au/~neilb/patches/linux/2.4.5-pre2/patch-A-rdevfix

contains a correct version of the patch.  It is in 2.4.5.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



Re: md= problems with devfs names

2001-06-02 Thread Neil Brown

On Saturday June 2, [EMAIL PROTECTED] wrote:
 
 I've moved from:
   md=4,/dev/sdf5,/dev/sdg5
 to:
   md=4,/dev/scsi/host0/bus0/target30/lun0/part5,\
   /dev/scsi/host0/bus0/target32/lun0/part5
 
 And now get:
   md: Unknown device name,\
   /dev/scsi/host0/bus0/target30/lun0/part5,\
   /dev/scsi/host0/bus0/target32/lun0/part5.
 
: (
 
 md_setup() is displaying the error due to failing on name_to_kdev_t().
 root_dev_setup() calls name_to_kdev_t() with a long devfs name without a
 problem, so that's not the issue directly.

Yes... this is all very ugly.
root_dev_setup also stores the device name in root_device_name.
And then when actualy mounting root in fs/super.c::mount_root,
devfs_find_handle is called to map that name into a devfs object.

So maybe md_setup should store names as well, and md_setup_drive
should call devfs_find_handle like mount_root does.

But probably sticking with non-devfs names is easier.
Was there a particular need to change to devfs naming?

NeilBrown

 
 I think md_setup() is being run before the devfs names are fully registered,
 but i have no clue how the execution order
 of __setup() items is determined.
 
 Help?
 
 Dave
 
 
 md_setup() is run VERY early, much earlier then raid_setup().
 dmesg excerpt:
 
 mapped APIC to e000 (fee0)
 mapped IOAPIC to d000 (fec0)
 Kernel command line: devfs=mount raid=noautodetect 
 root=/dev/scsi/host0/bus0/target2/lun0/part7
 
md=4,/dev/scsi/host0/bus0/target30/lun0/part5,/dev/scsi/host0/bus0/target32/lun0/part5
 mem=393216K
 md: Unknown device name,
 /dev/scsi/host0/bus0/target30/lun0/part5,/dev/scsi/host0/bus0/target32/lun0/part5.
 Initializing CPU#0
 Detected 875.429 MHz processor.
 Console: colour VGA+ 80x25
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



[PATCH] raid1 to use sector numbers in b_blocknr

2001-05-23 Thread Neil Brown


Linus,
  raid1 allocates a new buffer_head when passing a request done
  to an underlying device.
  It currently sets b_blocknr to b_rsector/(b_size9) from the
  original buffer_head to parallel other uses of b_blocknr (i.e. it
  being the number of the block).

  However, if raid1 gets a non-aligned request, then the calcuation of
  b_blocknr would loose information resulting in potential data
  curruption if the request were resubmitted to a different drive on
  failure. 

  Non aligned requests aren't currently possible (I believe) but newer
  filesystems are ikely to want them soon, and if a raid1 array were
  to be partitioned into partitions that were not page aligned, it
  could happen.

  This patch changes the usage of b_blocknr in raid1.c to store the
  value of b_rsector of the incoming request.

  Also, I remove the third argument to raid1_map which is never used.

NeilBrown

 
--- ./drivers/md/raid1.c2001/05/23 01:18:15 1.1
+++ ./drivers/md/raid1.c2001/05/23 01:18:19 1.2
@@ -298,7 +298,7 @@
md_spin_unlock_irq(conf-device_lock);
 }
 
-static int raid1_map (mddev_t *mddev, kdev_t *rdev, unsigned long size)
+static int raid1_map (mddev_t *mddev, kdev_t *rdev)
 {
raid1_conf_t *conf = mddev_to_conf(mddev);
int i, disks = MD_SB_DISKS;
@@ -602,7 +602,7 @@
 
bh_req = r1_bh-bh_req;
memcpy(bh_req, bh, sizeof(*bh));
-   bh_req-b_blocknr = bh-b_rsector / sectors;
+   bh_req-b_blocknr = bh-b_rsector;
bh_req-b_dev = mirror-dev;
bh_req-b_rdev = mirror-dev;
/*  bh_req-b_rsector = bh-n_rsector; */
@@ -646,7 +646,7 @@
/*
 * prepare mirrored mbh (fields ordered for max mem throughput):
 */
-   mbh-b_blocknr= bh-b_rsector / sectors;
+   mbh-b_blocknr= bh-b_rsector;
mbh-b_dev= conf-mirrors[i].dev;
mbh-b_rdev   = conf-mirrors[i].dev;
mbh-b_rsector= bh-b_rsector;
@@ -1138,7 +1138,6 @@
int disks = MD_SB_DISKS;
struct buffer_head *bhl, *mbh;
raid1_conf_t *conf;
-   int sectors = bh-b_size  9;

conf = mddev_to_conf(mddev);
bhl = raid1_alloc_bh(conf, conf-raid_disks); /* don't 
really need this many */
@@ -1168,7 +1167,7 @@
mbh-b_blocknr= bh-b_blocknr;
mbh-b_dev= conf-mirrors[i].dev;
mbh-b_rdev   = conf-mirrors[i].dev;
-   mbh-b_rsector= bh-b_blocknr * sectors;
+   mbh-b_rsector= bh-b_blocknr;
mbh-b_state  = (1BH_Req) | 
(1BH_Dirty) |
(1BH_Mapped) | (1BH_Lock);
atomic_set(mbh-b_count, 1);
@@ -1195,7 +1194,7 @@
}
} else {
dev = bh-b_dev;
-   raid1_map (mddev, bh-b_dev, bh-b_size  9);
+   raid1_map (mddev, bh-b_dev);
if (bh-b_dev == dev) {
printk (IO_ERROR, partition_name(bh-b_dev), 
bh-b_blocknr);
md_done_sync(mddev, bh-b_size9, 0);
@@ -1203,6 +1202,7 @@
printk (REDIRECT_SECTOR,
partition_name(bh-b_dev), 
bh-b_blocknr);
bh-b_rdev = bh-b_dev;
+   bh-b_rsector = bh-b_blocknr; 
generic_make_request(READ, bh);
}
}
@@ -1211,8 +1211,7 @@
case READ:
case READA:
dev = bh-b_dev;
-   
-   raid1_map (mddev, bh-b_dev, bh-b_size  9);
+   raid1_map (mddev, bh-b_dev);
if (bh-b_dev == dev) {
printk (IO_ERROR, partition_name(bh-b_dev), 
bh-b_blocknr);
raid1_end_bh_io(r1_bh, 0);
@@ -1220,6 +1219,7 @@
printk (REDIRECT_SECTOR,
partition_name(bh-b_dev), bh-b_blocknr);
bh-b_rdev = bh-b_dev;
+   bh-b_rsector = bh-b_blocknr;
generic_make_request (r1_bh-cmd, bh);
}
break;
@@ -1313,6 +1313,7 @@
struct 

[PATCH] raid resync by sectors to allow for 512byte block filesystems

2001-05-17 Thread Neil Brown


Linus,
 The current raid1/raid5 resync code requests resync in units of 1k
 (though the raid personality can round up requests if it likes).
 This interacts badly with filesystems that do IO in 512 byte blocks,
 such as XFS (because raid5 need to use the same blocksize for IO and
 resync).

 The attached patch changes the resync code to work in units of
 sectors which makes more sense and plays nicely with XFS.

NeilBrown



--- ./drivers/md/md.c   2001/05/17 05:50:51 1.1
+++ ./drivers/md/md.c   2001/05/17 06:11:50 1.2
@@ -2997,7 +2997,7 @@
int sz = 0;
unsigned long max_blocks, resync, res, dt, db, rt;
 
-   resync = mddev-curr_resync - atomic_read(mddev-recovery_active);
+   resync = (mddev-curr_resync - atomic_read(mddev-recovery_active))/2;
max_blocks = mddev-sb-size;
 
/*
@@ -3042,7 +3042,7 @@
 */
dt = ((jiffies - mddev-resync_mark) / HZ);
if (!dt) dt++;
-   db = resync - mddev-resync_mark_cnt;
+   db = resync - (mddev-resync_mark_cnt/2);
rt = (dt * ((max_blocks-resync) / (db/100+1)))/100;

sz += sprintf(page + sz,  finish=%lu.%lumin, rt / 60, (rt % 60)/6);
@@ -3217,7 +3217,7 @@
 
 void md_done_sync(mddev_t *mddev, int blocks, int ok)
 {
-   /* another blocks (1K) blocks have been synced */
+   /* another blocks (512byte) blocks have been synced */
atomic_sub(blocks, mddev-recovery_active);
wake_up(mddev-recovery_wait);
if (!ok) {
@@ -3230,7 +3230,7 @@
 int md_do_sync(mddev_t *mddev, mdp_disk_t *spare)
 {
mddev_t *mddev2;
-   unsigned int max_blocks, currspeed,
+   unsigned int max_sectors, currspeed,
j, window, err, serialize;
kdev_t read_disk = mddev_to_kdev(mddev);
unsigned long mark[SYNC_MARKS];
@@ -3267,7 +3267,7 @@
 
mddev-curr_resync = 1;
 
-   max_blocks = mddev-sb-size;
+   max_sectors = mddev-sb-size1;
 
printk(KERN_INFO md: syncing RAID array md%d\n, mdidx(mddev));
printk(KERN_INFO md: minimum _guaranteed_ reconstruction speed: %d 
KB/sec/disc.\n,
@@ -3291,23 +3291,23 @@
/*
 * Tune reconstruction:
 */
-   window = MAX_READAHEAD*(PAGE_SIZE/1024);
-   printk(KERN_INFO md: using %dk window, over a total of %d 
blocks.\n,window,max_blocks);
+   window = MAX_READAHEAD*(PAGE_SIZE/512);
+   printk(KERN_INFO md: using %dk window, over a total of %d 
+blocks.\n,window/2,max_sectors/2);
 
atomic_set(mddev-recovery_active, 0);
init_waitqueue_head(mddev-recovery_wait);
last_check = 0;
-   for (j = 0; j  max_blocks;) {
-   int blocks;
+   for (j = 0; j  max_sectors;) {
+   int sectors;
 
-   blocks = mddev-pers-sync_request(mddev, j);
+   sectors = mddev-pers-sync_request(mddev, j);
 
-   if (blocks  0) {
-   err = blocks;
+   if (sectors  0) {
+   err = sectors;
goto out;
}
-   atomic_add(blocks, mddev-recovery_active);
-   j += blocks;
+   atomic_add(sectors, mddev-recovery_active);
+   j += sectors;
mddev-curr_resync = j;
 
if (last_check + window  j)
@@ -3325,7 +3325,7 @@
mark_cnt[next] = j - atomic_read(mddev-recovery_active);
last_mark = next;
}
-   
+
 
if (md_signal_pending(current)) {
/*
@@ -3350,7 +3350,7 @@
if (md_need_resched(current))
schedule();
 
-   currspeed = 
(j-mddev-resync_mark_cnt)/((jiffies-mddev-resync_mark)/HZ +1) +1;
+   currspeed = 
+(j-mddev-resync_mark_cnt)/2/((jiffies-mddev-resync_mark)/HZ +1) +1;
 
if (currspeed  sysctl_speed_limit_min) {
current-nice = 19;
--- ./drivers/md/raid5.c2001/05/17 05:50:51 1.1
+++ ./drivers/md/raid5.c2001/05/17 06:11:51 1.2
@@ -886,7 +886,7 @@
}
}
if (syncing) {
-   md_done_sync(conf-mddev, (sh-size10) - sh-sync_redone,0);
+   md_done_sync(conf-mddev, (sh-size9) - sh-sync_redone,0);
clear_bit(STRIPE_SYNCING, sh-state);
syncing = 0;
}   
@@ -1059,7 +1059,7 @@
}
}
if (syncing  locked == 0  test_bit(STRIPE_INSYNC, sh-state)) {
-   md_done_sync(conf-mddev, (sh-size10) - sh-sync_redone,1);
+   md_done_sync(conf-mddev, (sh-size9) - sh-sync_redone,1);
clear_bit(STRIPE_SYNCING, sh-state);
}

@@ -1153,13 +1153,13 @@
return correct_size;
 }
 
-static int raid5_sync_request (mddev_t *mddev, unsigned long block_nr)
+static int 

[PATCH] RAID5 NULL Checking Bug Fix

2001-05-15 Thread Neil Brown

On Wednesday May 16, [EMAIL PROTECTED] wrote:
 
 (more patches to come.  They will go to Linus, Alan, and linux-raid only).

This is the next one, which actually addresses the NULL Checking
Bug.

There are two places in the the raid code which allocate memory
without (properly) checking for failure and which are fixed in ac,
both in raid5.c.  One in grow_buffers and one in __check_consistency.

The one in grow_buffers is definately right and included below in a
slightly different form -  fewer characters.
The one in __check_consistency is best fixed by simply removing
__check_consistency.

__check_consistency reads stipes at some offset into the array and
checks that parity is correct.  This is called once, but the result is
ignored.
Presumably this is a hangover from times gone by when the superblock
didn't have proper versioning and there was no auto-rebuild process.
It is now irrelevant and can go.
There is similar code in raid1.c which should also go.
This patch removes said code.

NeilBrown



--- ./drivers/md/raid5.c2001/05/16 05:14:39 1.2
+++ ./drivers/md/raid5.c2001/05/16 05:27:20 1.3
@@ -156,9 +156,9 @@
return 1;
memset(bh, 0, sizeof (struct buffer_head));
init_waitqueue_head(bh-b_wait);
-   page = alloc_page(priority);
-   bh-b_data = page_address(page);
-   if (!bh-b_data) {
+   if ((page = alloc_page(priority)))
+   bh-b_data = page_address(page);
+   else {
kfree(bh);
return 1;
}
@@ -1256,76 +1256,6 @@
printk(raid5: resync finished.\n);
 }
 
-static int __check_consistency (mddev_t *mddev, int row)
-{
-   raid5_conf_t *conf = mddev-private;
-   kdev_t dev;
-   struct buffer_head *bh[MD_SB_DISKS], *tmp = NULL;
-   int i, ret = 0, nr = 0, count;
-   struct buffer_head *bh_ptr[MAX_XOR_BLOCKS];
-
-   if (conf-working_disks != conf-raid_disks)
-   goto out;
-   tmp = kmalloc(sizeof(*tmp), GFP_KERNEL);
-   tmp-b_size = 4096;
-   tmp-b_page = alloc_page(GFP_KERNEL);
-   tmp-b_data = page_address(tmp-b_page);
-   if (!tmp-b_data)
-   goto out;
-   md_clear_page(tmp-b_data);
-   memset(bh, 0, MD_SB_DISKS * sizeof(struct buffer_head *));
-   for (i = 0; i  conf-raid_disks; i++) {
-   dev = conf-disks[i].dev;
-   set_blocksize(dev, 4096);
-   bh[i] = bread(dev, row / 4, 4096);
-   if (!bh[i])
-   break;
-   nr++;
-   }
-   if (nr == conf-raid_disks) {
-   bh_ptr[0] = tmp;
-   count = 1;
-   for (i = 1; i  nr; i++) {
-   bh_ptr[count++] = bh[i];
-   if (count == MAX_XOR_BLOCKS) {
-   xor_block(count, bh_ptr[0]);
-   count = 1;
-   }
-   }
-   if (count != 1) {
-   xor_block(count, bh_ptr[0]);
-   }
-   if (memcmp(tmp-b_data, bh[0]-b_data, 4096))
-   ret = 1;
-   }
-   for (i = 0; i  conf-raid_disks; i++) {
-   dev = conf-disks[i].dev;
-   if (bh[i]) {
-   bforget(bh[i]);
-   bh[i] = NULL;
-   }
-   fsync_dev(dev);
-   invalidate_buffers(dev);
-   }
-   free_page((unsigned long) tmp-b_data);
-out:
-   if (tmp)
-   kfree(tmp);
-   return ret;
-}
-
-static int check_consistency (mddev_t *mddev)
-{
-   if (__check_consistency(mddev, 0))
-/*
- * We are not checking this currently, as it's legitimate to have
- * an inconsistent array, at creation time.
- */
-   return 0;
-
-   return 0;
-}
-
 static int raid5_run (mddev_t *mddev)
 {
raid5_conf_t *conf;
@@ -1483,12 +1413,6 @@
if (conf-working_disks != sb-raid_disks) {
printk(KERN_ALERT raid5: md%d, not all disks are operational -- 
trying to recover array\n, mdidx(mddev));
start_recovery = 1;
-   }
-
-   if (!start_recovery  (sb-state  (1  MD_SB_CLEAN)) 
-   check_consistency(mddev)) {
-   printk(KERN_ERR raid5: detected raid-5 superblock xor inconsistency 
-- running resync\n);
-   sb-state = ~(1  MD_SB_CLEAN);
}
 
{
--- ./drivers/md/raid1.c2001/05/16 05:14:39 1.2
+++ ./drivers/md/raid1.c2001/05/16 05:27:20 1.3
@@ -1448,69 +1448,6 @@
}
 }
 
-/*
- * This will catch the scenario in which one of the mirrors was
- * mounted as a normal device rather than as a part of a raid set.
- *
- * check_consistency is very personality-dependent, eg. RAID5 cannot
- * do this check, it uses another method.
- */
-static int __check_consistency 

[PATCH] - md_error gets simpler

2001-05-15 Thread Neil Brown


Linus,
 This isn't a bug fix, just a tidy up.

 Current, md_error - which is called when an underlying device detects
 an error - takes a kdev_t to identify which md array is affected.
 It converts this into a mddev_t structure pointer, and in every case,
 the caller already has the desired structure pointer.

 This patch changes md_error and the callers to pass an mddev_t
 instead of a kdev_t

NeilBrown

--- ./include/linux/raid/md.h   2001/05/16 06:08:41 1.1
+++ ./include/linux/raid/md.h   2001/05/16 06:10:02 1.2
@@ -80,7 +80,7 @@
 extern struct gendisk * find_gendisk (kdev_t dev);
 extern int md_notify_reboot(struct notifier_block *this,
unsigned long code, void *x);
-extern int md_error (kdev_t mddev, kdev_t rdev);
+extern int md_error (mddev_t *mddev, kdev_t rdev);
 extern int md_run_setup(void);
 
 extern void md_print_devices (void);
--- ./drivers/md/raid5.c2001/05/16 05:27:20 1.3
+++ ./drivers/md/raid5.c2001/05/16 06:10:02 1.4
@@ -412,7 +412,7 @@
spin_lock_irqsave(conf-device_lock, flags);
}
} else {
-   md_error(mddev_to_kdev(conf-mddev), bh-b_dev);
+   md_error(conf-mddev, bh-b_dev);
clear_bit(BH_Uptodate, bh-b_state);
}
clear_bit(BH_Lock, bh-b_state);
@@ -440,7 +440,7 @@
 
md_spin_lock_irqsave(conf-device_lock, flags);
if (!uptodate)
-   md_error(mddev_to_kdev(conf-mddev), bh-b_dev);
+   md_error(conf-mddev, bh-b_dev);
clear_bit(BH_Lock, bh-b_state);
set_bit(STRIPE_HANDLE, sh-state);
__release_stripe(conf, sh);
--- ./drivers/md/md.c   2001/05/16 06:08:41 1.1
+++ ./drivers/md/md.c   2001/05/16 06:10:03 1.2
@@ -2464,7 +2464,7 @@
int ret;
 
fsync_dev(mddev_to_kdev(mddev));
-   ret = md_error(mddev_to_kdev(mddev), dev);
+   ret = md_error(mddev, dev);
return ret;
 }
 
@@ -2938,13 +2938,11 @@
 }
 
 
-int md_error (kdev_t dev, kdev_t rdev)
+int md_error (mddev_t *mddev, kdev_t rdev)
 {
-   mddev_t *mddev;
mdk_rdev_t * rrdev;
int rc;
 
-   mddev = kdev_to_mddev(dev);
 /* printk(md_error dev:(%d:%d), rdev:(%d:%d), (caller: 
%p,%p,%p,%p).\n,MAJOR(dev),MINOR(dev),MAJOR(rdev),MINOR(rdev), 
__builtin_return_address(0),__builtin_return_address(1),__builtin_return_address(2),__builtin_return_address(3));
  */
if (!mddev) {
--- ./drivers/md/raid1.c2001/05/16 05:27:20 1.3
+++ ./drivers/md/raid1.c2001/05/16 06:10:03 1.4
@@ -388,7 +388,7 @@
 * this branch is our 'one mirror IO has finished' event handler:
 */
if (!uptodate)
-   md_error (mddev_to_kdev(r1_bh-mddev), bh-b_dev);
+   md_error (r1_bh-mddev, bh-b_dev);
else
/*
 * Set R1BH_Uptodate in our master buffer_head, so that
@@ -1426,7 +1426,7 @@
 * We don't do much here, just schedule handling by raid1d
 */
if (!uptodate)
-   md_error (mddev_to_kdev(r1_bh-mddev), bh-b_dev);
+   md_error (r1_bh-mddev, bh-b_dev);
else
set_bit(R1BH_Uptodate, r1_bh-state);
raid1_reschedule_retry(r1_bh);
@@ -1437,7 +1437,7 @@
struct raid1_bh * r1_bh = (struct raid1_bh *)(bh-b_private);

if (!uptodate)
-   md_error (mddev_to_kdev(r1_bh-mddev), bh-b_dev);
+   md_error (r1_bh-mddev, bh-b_dev);
if (atomic_dec_and_test(r1_bh-remaining)) {
mddev_t *mddev = r1_bh-mddev;
unsigned long sect = bh-b_blocknr * (bh-b_size9);
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



Re: spare-disk in a RAID1 set? Conflicting answers...

2001-04-29 Thread Neil Brown

On Saturday April 28, [EMAIL PROTECTED] wrote:
   Question: can you have one or more spare-disk entries in /etc/raidtab when 
 running a RAID1 set?
 
   First answer: the Linux Software-RAID HOWTO says yes, and gives an example 
 of this in the section on RAID1 config in raidtab.
 
   Second answer: the manpage for raidtab says no, stating that spare-disk is 
 only valid for RAID4 and RAID5.
 
   Hm.. Which is it?

You can certainly have spare discs in raid1 arrays.

NeilBrown

 
   I'm running Mandrake 8.0, which is a 2.4.3 kernel.  I haven't tried to 
 actually use a spare-disk entry yet, because I'm still waiting for the 
 third disk for my RAID1 set to get here, but I thought I'd ask to see if 
 anybody knows for sure.  If not, I'll experiment with it once my third disk 
 gets here and report back.
 
   Thanks!
 
 - Al
 
 
 ---
 | voice: 503.247.9256
   Lots of folks confuse bad management  | email: [EMAIL PROTECTED]
  with destiny.  | cell: 503.709.0028
 | email to my cell:
  - Kin Hubbard  |  [EMAIL PROTECTED]
 ---
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]



Re: Strange performance results in RAID5

2001-03-28 Thread Neil Brown

On Thursday March 29, [EMAIL PROTECTED] wrote:
 Hi,
 
 I have been doing some performance checks on my RAID 5 system.

Good.

 
 The system is
 
 5 Seagate Cheetahs X15
 Linux 2.4.2
 
 I am using IOtest 3.0 on /dev/md0
 My chunk size is 1M...
 
 When I do random reads of 64K blobs using one process, I get 100 
 reads/sec, which is the same as doing random reads on one disk. So I was 
 quite happy with that.
 
 My next test was to do random reads using ten processes, I expected 500 
 reads/sec, however, I only got 250 reads/sec.
 
 This to me doesn't seem right??? Does anyone know why this is the
 case?

A few possibilities:

   1/ you didn't say how fast your SCSI buss is.  I guess if it is
   reasonably new it would be at least 80Mb/sec which should allow 
   500 * 64K/s but it wouldn't have to be too old to not allow that,
   and I don't like to assume things that aren't stated.

   2/ You could be being slowed down by the stripe cache - it only
   allows 256 concurrent 4k access.   Try increasing NR_STRIPES at the
   top of drivers/md/raid5.c - say to 2048.  See if that makes a
   difference.

   3/ Also, try applying

 http://www.cse.unsw.edu.au/~neilb/patches/linux/2.4.3-pre6/patch-F-raid5readbypass

   This patch speeds up large sequential reads, at a possible small cost
   to random read-modify-writes (I haven't measured any problems, but
   haven't had the time to explore the performance thoroughly).
   What it does is read directly into the filesystems buffer instead
   of into the stripe cache and then memcpy into the filesys buffer.

   4/ I'm assuming you are doing direct IO to /dev/md0.
   Try making a mounting a filesystem of /dev/md0 first. This will
   switch the device blocksize to 4K (if you have a 4k block size
   filesystem).  The larger block size improves performance
   substantially.   I always do I/O tests to a filesystem, not to the
   block device, because it makes a difference and it is a filesystem
   that I want to use (though I realise that you may not).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: Error Injector?

2001-03-21 Thread Neil Brown

On Wednesday March 21, [EMAIL PROTECTED] wrote:
 
 My question is based upon prior experience working for Stratus Computer.  At
 Stratus it was impractical to go beat the disk drives with a hammer to cause
 them to fail - rather we would simply use a utility to cause the disk driver
 to begin to get "errors" from the drives.  This would then exercise the 
 recovery mechanism - taking a drive off line and bringing another up to
 take its place.  This facility is also present in Veritas Volume Manager test
 suites to exercise the code.

raidsetfaulty 
should do what you want.  It is part of the latest raidtools-0.90.
If you don't have it, get the source from
www.kernel.org/pub/linux/daemons/...
(it might be devel rather than daemons, I'm not sure).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: disk fails in raid5 but not in raid0

2001-03-19 Thread Neil Brown

On Monday March 19, [EMAIL PROTECTED] wrote:
 Hi,
 
 I have a RAID setup, 3 Compaq 4Gb drives running of an Adaptec2940UW. 
 Kernel 2.2.18 with RAID-patches etc.
 
 I have been trying out various options, doing some stress-testing etc.,
 and I have now arrived at the following situation that I cannot explain:
 
 when running the 3 drives in a RAID5 config, one of the drives (always the
 same one) will always fail in during heavy IO or during a resync phase. It
 appears to produce one IO error (judging from messages in the log), upon
 which it is promptly removed from the array.
 I can then hotremove the failing drive, then hotadd it - and resync starts, 
 and quite often completes. This scenario is consistently repeatable.

During the initial resync phase, the data blocks are read and the
parity blocks are written -- across all drives.
During a rebuild-after-failure, data and paritiy are read from a
"good" drives and data-or-parity is written to the spare drive.

This could lead to differrent patterns of concurrent access.  In
particular, duing the resync that you say often completes, the
questionable drive is only being written to.  During the resync that
usually fails, the questionable drive is often being read concurently
with other drives.

 
 So, it would seem that this one drive has a hardware problem. So I ran badblocks
 with write-check on it, couple of times - came out 100% clean.
 I then built a RAID0 array instead - and started driving lots of IO on it - 
 it's still running - not a problem. Filled up the array, still no probs.
 
 So, except when the drive is in a RAID5 config, it seems ok. 

Well, raid5 would do about 30% more IO when writing.  It certainly
sounds odd, but it could be some combinatorial thing..

 
 Any suggestions ? I would like to confirm whether or whether not the
 drive has a problem. 

Try re-arranging the drives on the scsi chain.  If the questionable
one is currently furthest from the host-adapter, make it closest.  See
if that has any effect.
It could well be cabling, or terminators or something.  Or it could be
the drive.

NeilBrown

 
 thanks,
 Per Jessen
 
 
 
 
 
 regards,
 Per Jessen
 
 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: Problem migrating RAID1 from 2.2.x to 2.4.2

2001-03-19 Thread Neil Brown

On Monday March 19, [EMAIL PROTECTED] wrote:
 I'm having trouble running a RAID1 root/boot mirror under 2.4.2.  Works
 fine on 2.2.14 though.
 
 I'm running RH 6.2 with stock 2.2.14 kernel.  Running RAID1 on a pair of
 9.1 UW SCSI Barracudas as root/boot/lilo.  md0 is / and md1 is 256M swap,
 also a 2 drive mirror. I built the RAID1 at install time using the Redhat GUI.
 
 This configuration works flawlessly.  However, I've recently compiled the
 2.4.2 kernel, with no module support; RAID1 static.  When 2.4.2 boots, I
 get an "Kernel panic: VFS: Unable to mount root fs on 09:00".
 
 Here's the RAID driver output when booting 2.4.2:
 
 autodetecting RAID arrays
 (read) sda5's sb offset: 8618752 [events: 0022]
 (read) sdb5's sb offset: 8618752 [events: 0022]
 autorun ...
 considering sdb5 ...
   adding sdb5 ...
   adding sda5 ...
 created md0
 bindsda5,1
 bindsdb5,2
 running: sdb5sda5
 now!
 sdb5's event counter: 0022
 sda5's event counter: 0022
 do_md_run() returned -22
 md0 stopped.
 
 Note: This RAID1 mirror works great under 2.2.14.  Under 2.4.2 I get the
 "returned -22" - What does this mean?

 -22 == EINVAL

It looks very much like raid1 *isn't* compiled into your kernel.
Can you show us your .config file?

Also /proc/mdstat when booted under 2.2 might help.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: Proposed RAID5 design changes.

2001-03-16 Thread Neil Brown


(I've taken Alan and Linus off the Cc list. I'm sure they have plenty
to read, and may even be on linux-raid anyway).

On Thursday March 15, [EMAIL PROTECTED] wrote:
 I'm not too happy with the linux RAID5 implementation. In my
 opinion, a number of changes need to be made, but I'm not sure how to
 make them or get them accepted into the official distribution if I did
 make the changes.

I've been doing a fair bit of development with RAID5 lately and Linus
seems happy to accept patches from me, and I am happy to work with you
(or anyone else) to make improvements and them submit them to Linus.

There was a paper in 
   2000 USENIX Annual Technical Conference
titled
   Towards Availability Benchmarks: A Case Study of Software RAID
   Systems
by
   Aaron Brown and David A. Patterson of UCB.

They built a neat rig for testing fault tolerance and fault handling
in raid systems and compared Linux, Solaris, and WinNT.

Their particular comment about Linux was that it seemed to evict
drives on any excuse, just as you observe.  Apparently the other
systems tried much harder to keep drives in the working set if
possible.
It is certainly worth a read if you are interested in this.

My feeling about retrying after failed IO is that it should be done at
a lower level.  Once the SCSI or IDE level tells us that there is a
READ error, or a WRITE error, we should believe them.
Now it appears that this isn't true: at least not for all drivers.
So while I would not be strongly against putting that sort of re-try
logic at the RAID level, I think it would be worth the effort to find
out why it isn't being done at a lower level.

As for re-writing after a failed read, that certainly makes sense, and
probably wouldn't be too hard.
You would introduce into the "struct stripe_head" a way to mark a
drive as "read-failed".
Then on a read error, you mark that drive as read-failed in that
stripe only and schedule a retry.
If the retry succeeds, you then schedule a write, and if that
works, you just continue on happily.

You would need to make sure that you aren't too generous: once you
have had some number of read errors on a given drive you really should
fail that drive anyway.

 3) Drives should not be kicked out of the array unless they are having
really persistent problems. I've an idea on how to define 'really
persistent' but it requires a bit of math to explain, so I'll only
go into it if someone is interested.

I'd certainly be interested in reading your math.

 
 Then there are two changes that might improve recovery performance:
 
 4) If the drive being kicked out is not totally inoperable and there is
a spare drive to replace it, try to copy the data from the failing
drive to the spare rather than reconstructing the data from all the
other disks. Fall back to full reconstruction if the error rate gets
too high.

That would actually be fairly easy to do.  Once you get the data
structures right so that the concept of a "partially failed" drive can
be clearly represented, it should be a cinch.

 
 5) When doing (4) use the SCSI 'copy' command if the drives are on the
same bus, and the host adapter and driver supports 'copy'. However,
this should be done with caution. 'copy' is not generally used and
any number of undetected firmware bugs might make it unreliable.
An additional category may need to be added to the device black list
to flag devices that can not do 'copy' reliably.

I've very against this sort of idea.  Currently the raid code is
blissfully unaware of the underlying technology: it could be scsi,
ide, ramdisc, netdisk or anything else and RAID just doesn't care.
This I believe is one of the strengths of software RAID - it is
cleanly layered.
Firmware (==hardware) raid controllers often try to "know" a lot about
the underlying drive - even to the extent of getting the drives  to do
the XOR themselves I believe.  This undoubtedly makes the code more
complex, and can lead to real problems if you have firmware-mismatches
(and we have had a few of those).

Stick with "read" and "write" I think.  Everybody understands what
they mean so it is much more likely to work.
And really, our rebuild performance isn't that bad.  The other interesting result
for Linux in that paper is that rebuild made almost no impact on
performance with Linux, while it did for solaris and NT (but Linux did
rebuild much more slowly).

So if you want to do this, dive right in and have a go.
I am certainly happy to answer any questions, review any code, and
forward anything that looks good to Linus.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: RaidHotAdd and reconstruction

2001-03-04 Thread Neil Brown

On Sunday March 4, [EMAIL PROTECTED] wrote:
 Hi folks,
 
 I have a two-disk RAID 1 test array that I was playing with. I then
 decided to hot add a third disk using ``raidhotadd''. The disk was added
 to the array, but as far as I could see, the RAID software did not start
 a reconstruction of that newly added disk. A skimmed through the driver
 code a bit, and could not really locate the point where the
 reconstruction was initiated. Am I missing something?
 
The third disk that you added became a hot spare.
You cannot add an extra active drive to a RAID array without using
mkraid.

In you case, you could edit /etc/raidtab to list the third strive as a
"failed-disk" instead of a "raid-disk", and set the "nr-taid-disks" to
3.

Then run mkraid.  It shouldn't destroy any data, but the raid system
should automatically start building data onto the new drive.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: [lvm-devel] Re: partitions for RAID volumes?

2001-02-27 Thread Neil Brown

On Monday February 26, [EMAIL PROTECTED] wrote:
 
 Actually, the LVM metadata is somewhat poorly layed out in this respect.
 The metadata is at the start of the device, and on occasion is not even
 sector aligned, AFAICS.  Also, the PE structures, while large powers of
 2 in size, are not guaranteed to be other than sector aligned.  They
 aligned with the END of the partition/device, and not the start.  I think
 under Linux, partitions are at least 1k multiples in size so the PEs will
 at least be 1k aligned...

MD/RAID volumes are always a multiple of 64k.  The amount of a single
device will be rounded down using MD_NEW_SIZE_BLOCKS defined as:

#define MD_RESERVED_BYTES   (64 * 1024)
#define MD_RESERVED_SECTORS (MD_RESERVED_BYTES / 512)
#define MD_RESERVED_BLOCKS  (MD_RESERVED_BYTES / BLOCK_SIZE)

#define MD_NEW_SIZE_SECTORS(x)  ((x  ~(MD_RESERVED_SECTORS - 1)) - 
MD_RESERVED_SECTORS)
#define MD_NEW_SIZE_BLOCKS(x)   ((x  ~(MD_RESERVED_BLOCKS - 1)) - 
MD_RESERVED_BLOCKS)

And the whole array will be the sum of 1 or more of these sizes.
So if each PE is indeed  sized "from 8KB to 512MB in powers of 2 and
unit KB", then all accesses should be properly aligned, so you
shouldn't have any problems (and if you apply the patch and get no
errors, then you will be doubly sure).

I thought a bit more about the consequences of unaligned accesses and
I think it is most significant when rebuilding parity.

RAID5 assumes that two different stripes with different sector
addresses will not overlap.
If all accesses are properly aligned, then this will be true.  Also if
all accesses are misaligned by the same amount (e.g. 4K accesses at
(4n+1)K offsets) then everything should work well too.
However, raid5 resync always aligns accesses so if the current
stripe-cache size were 4K, all sync accesses would be at (4n)K
offsets.
If there were (4n+1)K accesses happening at the same time, they would
not synchronise with the resync accesses and you could get corruption.

But it sounds like LVM is safe from this problem.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: Going from 'old' (kernel v2.2.x) to 'new' (kernel v2.4.x) raidsystem

2001-02-26 Thread Neil Brown

On  February 26, [EMAIL PROTECTED] wrote:
 I'm currently running a standard v2.2.17 kernel w/ the 'accompanying'
 raid system (linear).
 
 Having the following /etc/mdtab file:
 /dev/md0  linear,4k,0,75f3bcd8/dev/sdb1 /dev/sdc1 /dev/sdd10 /dev/sde1 
/dev/sdf1 /dev/sdg1 
 
 And converted this to a newer /etc/raidtab:
 raiddev /dev/md0
   raid-level  linear
   nr-raid-disks   6
   persistent-superblock   1

"old" style linear arrays don't have a super block, so this should
read:
persistent-superblock   0

Given this, you should be able to run mkraid with complete safety as
is doesn't actually write anything to any drive.
You might be more comfortable running "raid0run" instead of "mkraid".
It is the same program, but when called as raid0run, it checks that
your configuration matches an old style raid0/linear setup.



   device  /dev/sdb1
   raid-disk   0
   device  /dev/sdc1
   raid-disk   1
   device  /dev/sdd10
   raid-disk   2
   device  /dev/sde1
   raid-disk   3
   device  /dev/sdf1
   raid-disk   4
   device  /dev/sdg1
   raid-disk   5
 
 And this is what cfdisk tells me about the partitions:
 sdb1Primary  Linux raid autodetect 6310.74
 sdc1Primary  Linux raid autodetect 1059.07
 sdd10   Logical  Linux raid autodetect 2549.84
 sde1Primary  Linux raid autodetect 9138.29
 sdf1Primary  Linux raid autodetect18350.60
 sdg1Primary  Linux raid autodetect16697.32
 

Autodetect cannot work with old-style arrays that don't have
superblocks. If you want autodetect, you will need to copy the data
somewhere, create a new array, and copy it all back.

NeilBrown

 
 But when I start the new kernel, it won't start the raidsystem...
 I tried the 'mkraid --upgrade' command, but that says something about
 no superblock stuff... Can't remember exactly what it says, but...
 
 And I'm to afraid to just execute 'mkraid' to create the array. I have
 over 50Gb of data that I can't backup somewhere...
 
 
 What can I do to keep the old data, but convert the array to the new
 raid system? 
 
 -- 
  Turbo __ _ Debian GNU Unix _IS_ user friendly - it's just 
  ^/ /(_)_ __  _   ___  __  selective about who its friends are 
  / / | | '_ \| | | \ \/ /   Debian Certified Linux Developer  
   _ /// / /__| | | | | |_| |Turbo Fredriksson   [EMAIL PROTECTED]
   \\\/  \/_|_| |_|\__,_/_/\_\ Stockholm/Sweden
 -- 
 Iran domestic disruption Treasury Panama assassination cracking
 genetic Albanian jihad president Noriega AK-47 Khaddafi ammonium DES
 [See http://www.aclu.org/echelonwatch/index.html for more about this]
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: Urgent Problem: moving a raid

2001-02-25 Thread Neil Brown

On Sunday February 25, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
   OK, this time I really want to know how this should be handled.
  
  Well. it "should" be handled by re-writing various bits of raid code
  to make it all work more easily, but without doing that it "could" be
  handled by marking the partitions as hold raid componenets (0XFE I
  think) and then booting a kernel with AUTODETECT_RAID enabled. This
  approach ignores the device info in the superblock and finds
  everything properly.
 
 I do not use partitions, the whole /dev/hdb and /dev/hdd are the RAID
 drives (mainly because fdisk was unhappy handling the 60GB drives). Is
 there a way to do the above marking in this situation? How?

No, without partitions, that idea falls through.

With 2.4, you could boot with

   md=1,/dev/whatever,/dev/otherone

and it should build the array from the named drives.
There are ioctls available to do the same thing from user space, but
no user-level code has been released to use it yet.
The following patch, when applied to raidtools-0.90 should make
raidstart do the right thing, but I it is a while since I wrote this
code and I only did minimal testing.

NeilBrown

--- ./raidlib.c 2000/05/19 03:42:47 1.1
+++ ./raidlib.c 2000/05/19 06:53:04
@@ -149,6 +149,24 @@
return 0;
 }
 
+static int do_newstart (int fd, md_cfg_entry_t *cfg)
+{
+   int i;
+   if (ioctl (fd, SET_ARRAY_INFO, 0UL)== -1)
+   return -1;
+   /* Ok, we have a new enough kernel (2.3.99pre9?) */
+   for (i=0; icfg-array.param.nr_disks ; i++) {
+   struct stat s;
+   md_disk_info_t info;
+   stat(cfg-device_name[i], s);
+   memset(info, 0, sizeof(info));
+   info.major = major(s.st_rdev);
+   info.minor = minor(s.st_rdev);
+   ioctl(fd, ADD_NEW_DISK, (unsigned long)info);
+   }
+   return (ioctl(fd, RUN_ARRAY, 0UL)!= 0);
+}
+
 int do_raidstart_rw (int fd, char *dev)
 {
int func = RESTART_ARRAY_RW;
@@ -380,10 +398,12 @@
{
struct stat s;
 
-   stat (cfg-device_name[0], s);
-
fd = open_or_die(cfg-md_name);
-   if (do_mdstart (fd, cfg-md_name, s.st_rdev)) rc++;
+   if (do_newstart (fd, cfg)) {
+   stat (cfg-device_name[0], s);
+
+   if (do_mdstart (fd, cfg-md_name, s.st_rdev)) rc++;
+   }
break;
}
 

   
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: MD reverting to old Raid type

2001-02-25 Thread Neil Brown

On Sunday February 25, [EMAIL PROTECTED] wrote:
 Linux 2.4.1/RAIDtools2 0.90
 
 I have 4 ide disks which have identical partition layouts.
 RAID is working successfully, its even booting RAID1.
 
 I created a RAID5 set on a set of 4 partitions, which works OK. 
 I then destroyed that set and updated it so it was 2x RAID0
 partitions (so I can mirror them into a RAID10).
 
 The problem is when I raidstop, then raidstart either of the new RAID0
 mds it reverts back to the RAID5 (I originally noticed it when I rebooted).
snip
 
 
 Any idea?

In the raidtab file where you describe the raid0 arrays, make sure to
say:

  persistent-superblock = 1

(or whatever the correct systax is).  The default is 0 (== no) for
back compat I assume, and so your raid5 superblock doesn't get
overwritten.

NeilBrown

 
 Cheers, Suad
 --
 
 
 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: Newbie questions

2001-02-21 Thread Neil Brown

On Wednesday February 21, [EMAIL PROTECTED] wrote:
 Hello,
 
 This is my first time playing with software raid so sorry if I sound dumb.
 What I have is a remote device that only has one hard drive.  There is no
 ability for a second.  Can I use the raidtools package to setup a raid-1
 mirror on two partitions on the same drive?  For example /dev/hda1 and
 /dev/hda2 consist of the raid 1 set, /dev/hda3 swap, and /dev/hda4 for
 the rest of the OS.  I know raid is normally used with multiple drives,
 but is this possible?

Yes, it is possible, but would it help?
If your drive fails, then you loose the data anyway.
I guess this would protect against bad sectors in one part of the
drive, but my experience is that once you get a bad sector or two your
drive isn't long for this world.

Also, write speed would be appalling as the head would be zipping back
and forth between the two partitions.

However, the best answer is "try it", and see if it does what you
want.


 
 P.S.  Could you please respond to [EMAIL PROTECTED]  I am not on the
 list and could not find any info on how to join.
 
echo help | mail [EMAIL PROTECTED]

should get you started.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: Status of raid.

2001-02-09 Thread Neil Brown

On Friday February 9, [EMAIL PROTECTED] wrote:
 Greetings,
 
 I'm getting ready to put kernel 2.4.1 on my server at home.  I have some
 questions about the status of RAID in 2.4.1.  Sorry to be dense but I
 couldn't glean the answers to these questions from my search of the
 mailing list.
 
 1. It appears that as of 2.4.1 RAID is finally part of the standard
 kernel.  Is this correct?

Yes, though you will need to wait for 2.4.2 if you want to compile md
as a module.

 2. Which raidtools package do I use and where can I get it?  Or is it,
 too, enclosed with the kernel?

The same ones you would use with patches 2.2.  i.e. 0.90.

 3. Does the RAID in 2.4.1 have the read-balancing patch?
 

Yes, that patch is in.

NeilBrown


 --
   / C. R. (Charles) Oldham | NCA Commission on Accreditation and \
  / Director of Technology  |   School Improvement \
 / [EMAIL PROTECTED]  | V:480-965-8703  F:480-965-9423\
 
 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: mkraid problems

2001-01-04 Thread Neil Brown

On Thursday January 4, [EMAIL PROTECTED] wrote:
 OK, I followed the instructions from linuxraid.org and
 it seems my kernel is now installed because now it
 doesn't boot.  Error message states that the root fs
 cannot be mounted on 08:03 then it halts.  What did I
 miss now?
 
 Chris

08:03 is /dev/sda3.  Is that what you would expect to be booting off -
the first scsi disc?   Is the scsi controller driver compiled into the
kernel properly?  Do the boot messages show that the scsi controller
was found?

NeilBrown

 
 
 --- Alvin Oga [EMAIL PROTECTED] wrote:
  
  hi ya chris...
  
  did you follow the instructions on www.linuxraid.org
  ???
  
  the generic raid patch to generic 2.2.18 fails to
  patch
  properly.. try to follow the steps at
  linuxraid.org
  
  that was a very good site...
  
  have fun
  alvin
  http://www.linux-1U.net ...  1U Raid5 
  
  On Wed, 3 Jan 2001, Chris Winczewski wrote:
  
   Here is /etc/raidtab and /proc/mdstat
   
   chris
   
   raidtab:
   # Sample raid-5 configuration
   raiddev   /dev/md0
   raid-level5
   nr-raid-disks 3
   chunk-size32
   
   # Parity placement algorithm
   
   #parity-algorithm left-asymmetric
   parity-algorithm  left-symmetric
   #parity-algorithm right-asymmetric
   #parity-algorithm right-symmetric
   
   # Spare disks for hot reconstruction
   nr-spare-disks0
   persistent-superblock 1
   
   device/dev/sdb1
   raid-disk 0
   
   device/dev/sdc1
   raid-disk 1
   
   device/dev/sdd1
   raid-disk 2
   
   
   mdstat:
   Personalities : [4 raid5]
   read_ahead not set
   md0 : inactive
   md1 : inactive
   md2 : inactive
   md3 : inactive
   
   
   --- Neil Brown [EMAIL PROTECTED] wrote:
On Wednesday January 3,
  [EMAIL PROTECTED]
wrote:
 mkraid aborts with no usefull error mssg on
  screen
or
 in the syslog.  My /etc/raidtab is set up
correctly
 and I am using raidtools2 with kernel 2.2.18
  with
raid
 patch installed.  Any ideas?
 
 Chris
 
Please include a copy of 
  /etc/raidtab
  /proc/mdstat

NeilBrown
   
   
   __
   Do You Yahoo!?
   Yahoo! Photos - Share your holiday photos online!
   http://photos.yahoo.com/
   -
   To unsubscribe from this list: send the line
  "unsubscribe linux-raid" in
   the body of a message to [EMAIL PROTECTED]
   
  
 
 
 __
 Do You Yahoo!?
 Yahoo! Photos - Share your holiday photos online!
 http://photos.yahoo.com/
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



PATCH - raid5 in 2.4.0-test13 - substantial rewrite with substantial performance increase

2000-12-20 Thread Neil Brown


Linus
  here is a rather large patch for raid5 in 2.4.0-test13-pre3.
  It is a substantial re-write of the stripe-cache handling code,
  which is the heart of the raid5 module.

  I have been sitting on this for a while so that others can test it
  (a few have) and so that I would have had quite a lot of experience
  with it myself.
  I am now happy that it is ready for inclusion in 2.4.0-testX.  I
  hope you will be too.

  What it does:

- processing of raid5 requests can require several stages
  (pre-read, calc parity, write).  To accomodate this, request
  handling is based on a simple state machine.
  Prior to this patch, state was explicitly recorded - there were
  different phases "READY", "READOLD", "WRITE" etc.
  With this patch, the state is made implicit in the b_state of
  the buffers in the cache.  This makes the processing code
  (handle_stripe) more flexable, and it is easier to add requests
  to a stripe at any stage of processing.
- With the above change, we no longer need to wait for a stripe to
  "complete" before adding a new request.  We at most need to wait
  for a spinlock to be released.  This allows more parallelism and
  provides throughput speeds many times the current speed.

- Without this patch, two buffers are allocated on each stripe in
  the cache for each underlying device in the array.  This is
  wasteful.  With the patch, only one buffer is needed per stripe,
  per device.

- This patch creates a couple of linked lists of stripes, one for
  stripes that are inactive, and one for stripe that need to be
  processed by raid5d.  This obviates the need to search the hash
  table for the stripes of interested in raid5d or when looking
  for a free stripe.

  There is more work to be done to bring raid5 performance upto the
  level of 2.2+mingos-patches, but this is a first, large, step on the
  way. 

NeilBrown


(2000 line patch included in mail to Linus, but removed from mail to
linux-raid.
If you want it, try
   http://www.cse.unsw.edu.au/~neilb/patch/linux/2.4.0-test13-pre3/patch-A-raid5
)
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



PATCH - raid1 next drive selection.

2000-12-10 Thread Neil Brown


Linus (et al)

 The raid1 code has a concept of finding a "next available drive".  It
 uses this for read balancing.
 Currently, this is implemented via a simple linked list that links
 the working drives together.

 However, there is no locking to make sure that the list does not get
 modified by two processors at the same time, and hence corrupted
 (though it is changed so  rarely that that is unlikely).
 Also, when choosing a drive to read from for rebuilding a new spare,
 the "last_used" drive is used, even though that might be a drive which
 failed recently.  i.e. there is no check the the "last_used" drive is
 actually still valid.  I managed to exploit this to get the kernel
 into a tight spin.

 This patch discards the linked list and just walks the array ignoring
 failed drives.  I also makes sure that "last_used" is always
 validated before being used.

 patch against 2.4.0-test12-pre8

NeilBrown

--- ./include/linux/raid/raid1.h2000/12/10 22:38:20 1.1
+++ ./include/linux/raid/raid1.h2000/12/10 22:38:25 1.2
@@ -7,7 +7,6 @@
int number;
int raid_disk;
kdev_t  dev;
-   int next;
int sect_limit;
int head_position;
 
--- ./drivers/md/raid1.c2000/12/10 22:36:34 1.2
+++ ./drivers/md/raid1.c2000/12/10 22:38:25 1.3
@@ -463,16 +463,12 @@
if (conf-resync_mirrors)
goto rb_out;

-   if (conf-working_disks  2) {
-   int i = 0;
-   
-   while( !conf-mirrors[new_disk].operational 
-   (i  MD_SB_DISKS) ) {
-   new_disk = conf-mirrors[new_disk].next;
-   i++;
-   }
-   
-   if (i = MD_SB_DISKS) {
+
+   /* make sure that disk is operational */
+   while( !conf-mirrors[new_disk].operational) {
+   if (new_disk = 0) new_disk = conf-raid_disks;
+   new_disk--;
+   if (new_disk == disk) {
/*
 * This means no working disk was found
 * Nothing much to do, lets not change anything
@@ -480,11 +476,13 @@
 */

new_disk = conf-last_used;
+
+   goto rb_out;
}
-   
-   goto rb_out;
}
-
+   disk = new_disk;
+   /* now disk == new_disk == starting point for search */
+   
/*
 * Don't touch anything for sequential reads.
 */
@@ -501,16 +499,16 @@

if (conf-sect_count = conf-mirrors[new_disk].sect_limit) {
conf-sect_count = 0;
-   
-   while( new_disk != conf-mirrors[new_disk].next ) {
-   if ((conf-mirrors[new_disk].write_only) ||
-   (!conf-mirrors[new_disk].operational) )
-   continue;
-   
-   new_disk = conf-mirrors[new_disk].next;
-   break;
-   }
-   
+
+   do {
+   if (new_disk=0)
+   new_disk = conf-raid_disks;
+   new_disk--;
+   if (new_disk == disk)
+   break;
+   } while ((conf-mirrors[new_disk].write_only) ||
+(!conf-mirrors[new_disk].operational));
+
goto rb_out;
}

@@ -519,8 +517,10 @@

/* Find the disk which is closest */

-   while( conf-mirrors[disk].next != conf-last_used ) {
-   disk = conf-mirrors[disk].next;
+   do {
+   if (disk = 0)
+   disk = conf-raid_disks;
+   disk--;

if ((conf-mirrors[disk].write_only) ||
(!conf-mirrors[disk].operational))
@@ -534,7 +534,7 @@
current_distance = new_distance;
new_disk = disk;
}
-   }
+   } while (disk != conf-last_used);
 
 rb_out:
conf-mirrors[new_disk].head_position = this_sector + sectors;
@@ -702,16 +702,6 @@
return sz;
 }
 
-static void unlink_disk (raid1_conf_t *conf, int target)
-{
-   int disks = MD_SB_DISKS;
-   int i;
-
-   for (i = 0; i  disks; i++)
-   if (conf-mirrors[i].next == target)
-   conf-mirrors[i].next = conf-mirrors[target].next;
-}
-
 #define LAST_DISK KERN_ALERT \
 "raid1: only one disk left and IO error.\n"
 
@@ -735,7 +725,6 @@
mdp_super_t *sb = mddev-sb;
 
mirror-operational = 0;
-   unlink_disk(conf, failed);
mark_disk_faulty(sb-disks+mirror-number);
mark_disk_nonsync(sb-disks+mirror-number);

PATCH - md device reference counting

2000-12-10 Thread Neil Brown


Linus (et al),

  An md device need to know if it is in-use so that it doesn't allow
  raidstop while still mounted.  Previously it did this by looking for
  a superblock on the device.  This is a bit in-elegant and doesn't
  generalise.

  With this patch, it tracks opens and closes (get and release) and
  does not allow raidstop while there is any active access.

  This leaves open the possibility of syncing out the superblocks on
  the last close, which might happen in a later patch.

  One interesting gotcha in this patch is that the START_ARRAY ioctl
  (used by raidstart) can potentially start a completely different
  array, as it decides which array to start based on a value in the
  raid superblock.
 
  To get the reference counts right, I needed to tell the code which
  array I think I am starting.  I it actually starts that one, it sets
  the initial reference count to 1, otherwise it sets it to 0.


 patch against 2.4.0-test12-pre8

NeilBrown



--- ./include/linux/raid/md_k.h 2000/12/10 22:54:16 1.1
+++ ./include/linux/raid/md_k.h 2000/12/10 23:21:26 1.2
@@ -206,6 +206,7 @@
struct semaphorereconfig_sem;
struct semaphorerecovery_sem;
struct semaphoreresync_sem;
+   atomic_tactive;
 
atomic_trecovery_active; /* blocks scheduled, but not 
written */
md_wait_queue_head_trecovery_wait;
--- ./drivers/md/md.c   2000/12/10 22:37:18 1.3
+++ ./drivers/md/md.c   2000/12/10 23:21:26 1.4
@@ -203,6 +203,7 @@
init_MUTEX(mddev-resync_sem);
MD_INIT_LIST_HEAD(mddev-disks);
MD_INIT_LIST_HEAD(mddev-all_mddevs);
+   atomic_set(mddev-active, 0);
 
/*
 * The 'base' mddev is the one with data NULL.
@@ -1718,12 +1719,20 @@
 
 #define STILL_MOUNTED KERN_WARNING \
 "md: md%d still mounted.\n"
+#defineSTILL_IN_USE \
+"md: md%d still in use.\n"
 
 static int do_md_stop (mddev_t * mddev, int ro)
 {
int err = 0, resync_interrupted = 0;
kdev_t dev = mddev_to_kdev(mddev);
 
+   if (atomic_read(mddev-active)1) {
+   printk(STILL_IN_USE, mdidx(mddev));
+   OUT(-EBUSY);
+   }
+ 
+   /* this shouldn't be needed as above would have fired */
if (!ro  get_super(dev)) {
printk (STILL_MOUNTED, mdidx(mddev));
OUT(-EBUSY);
@@ -1859,8 +1868,10 @@
  * the 'same_array' list. Then order this list based on superblock
  * update time (freshest comes first), kick out 'old' disks and
  * compare superblocks. If everything's fine then run it.
+ *
+ * If "unit" is allocated, then bump its reference count
  */
-static void autorun_devices (void)
+static void autorun_devices (kdev_t countdev)
 {
struct md_list_head candidates;
struct md_list_head *tmp;
@@ -1902,6 +1913,12 @@
continue;
}
mddev = alloc_mddev(md_kdev);
+   if (mddev == NULL) {
+   printk("md: cannot allocate memory for md drive.\n");
+   break;
+   }
+   if (md_kdev == countdev)
+   atomic_inc(mddev-active);
printk("created md%d\n", mdidx(mddev));
ITERATE_RDEV_GENERIC(candidates,pending,rdev,tmp) {
bind_rdev_to_array(rdev, mddev);
@@ -1945,7 +1962,7 @@
 #define AUTORUNNING KERN_INFO \
 "md: auto-running md%d.\n"
 
-static int autostart_array (kdev_t startdev)
+static int autostart_array (kdev_t startdev, kdev_t countdev)
 {
int err = -EINVAL, i;
mdp_super_t *sb = NULL;
@@ -2002,7 +2019,7 @@
/*
 * possibly return codes
 */
-   autorun_devices();
+   autorun_devices(countdev);
return 0;
 
 abort:
@@ -2077,7 +2094,7 @@
md_list_add(rdev-pending, pending_raid_disks);
}
 
-   autorun_devices();
+   autorun_devices(-1);
}
 
dev_cnt = -1; /* make sure further calls to md_autodetect_dev are ignored */
@@ -2607,6 +2624,8 @@
err = -ENOMEM;
goto abort;
}
+   atomic_inc(mddev-active);
+
/*
 * alloc_mddev() should possibly self-lock.
 */
@@ -2640,7 +2659,7 @@
/*
 * possibly make it lock the array ...
 */
-   err = autostart_array((kdev_t)arg);
+   err = autostart_array((kdev_t)arg, dev);
if (err) {
printk("autostart %s failed!\n",
partition_name((kdev_t)arg));
@@ -2820,14 +2839,26 @@
 static int md_open (struct inode *inode, struct file *file)
 {
/*

linus

2000-12-10 Thread Neil Brown


Linus (et al),

 The raid code wants to be the sole accessor of any devices are are
 combined into the array.  i.e. it want to lock those devices agaist
 other use.

 It currently tried to do this bby creating an inode that appears to
 be associated with that device.
 This no longer has any useful effect (and I don't think it has for a
 while, though I haven't dug into the history).

 I have changed the lock_rdev code to simple do a blkdev_get when
 attaching the device, and a blkdev_put when releasing it.  This
 atleast makes sure that if the device is in a module it wont be
 unloaded.
 Any further exclusive access control will needed to go into the
 blkdev_{get,put} routines at some stage I think.

 patch against 2.4.0-test12-pre8

NeilBrown


--- ./include/linux/raid/md_k.h 2000/12/10 23:21:26 1.2
+++ ./include/linux/raid/md_k.h 2000/12/10 23:33:07 1.3
@@ -165,8 +165,7 @@
mddev_t *mddev; /* RAID array if running */
unsigned long last_events;  /* IO event timestamp */
 
-   struct inode *inode;/* Lock inode */
-   struct file filp;   /* Lock file */
+   struct block_device *bdev;  /* block device handle */
 
mdp_super_t *sb;
unsigned long sb_offset;
--- ./drivers/md/md.c   2000/12/10 23:21:26 1.4
+++ ./drivers/md/md.c   2000/12/10 23:33:08 1.5
@@ -657,32 +657,25 @@
 static int lock_rdev (mdk_rdev_t *rdev)
 {
int err = 0;
+   struct block_device *bdev;
 
-   /*
-* First insert a dummy inode.
-*/
-   if (rdev-inode)
-   MD_BUG();
-   rdev-inode = get_empty_inode();
-   if (!rdev-inode)
+   bdev = bdget(rdev-dev);
+   if (bdev == NULL)
return -ENOMEM;
-   /*
-* we dont care about any other fields
-*/
-   rdev-inode-i_dev = rdev-inode-i_rdev = rdev-dev;
-   insert_inode_hash(rdev-inode);
-
-   memset(rdev-filp, 0, sizeof(rdev-filp));
-   rdev-filp.f_mode = 3; /* read write */
+   err = blkdev_get(bdev, FMODE_READ|FMODE_WRITE, 0, BDEV_FILE);
+   if (!err) {
+   rdev-bdev = bdev;
+   }
return err;
 }
 
 static void unlock_rdev (mdk_rdev_t *rdev)
 {
-   if (!rdev-inode)
+   if (!rdev-bdev)
MD_BUG();
-   iput(rdev-inode);
-   rdev-inode = NULL;
+   blkdev_put(rdev-bdev, BDEV_FILE);
+   bdput(rdev-bdev);
+   rdev-bdev = NULL;
 }
 
 static void export_rdev (mdk_rdev_t * rdev)
@@ -1150,7 +1143,7 @@
 
 abort_free:
if (rdev-sb) {
-   if (rdev-inode)
+   if (rdev-bdev)
unlock_rdev(rdev);
free_disk_sb(rdev);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: [OOPS] raidsetfaulty - raidhotremove - raidhotadd

2000-12-06 Thread Neil Brown

On Wednesday December 6, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  
  Could you try this patch and see how it goes?
 
 Same result!
 

Ok... must be something else... I tried again to reproduce it, and
this time I succeeded.
The problem happens when you try to access the last 128k of a raid1
array what have been reconstructed since the last reboot.

The reconstruction creates a sliding 3-part window which is 3*128k
wide.

The leading pane ("pending") may have some outstanding I/O requests,
but no new requests will be added.
The middle pane ("ready") has no outstanding I/O requests, and gets no
new I/O requests, but does get new rebuild requests.
The trailing pain ("active") has outstanding rebuild requests, but no
new I/O requests will be added.

This window slides forward through the address space keeping IO and
reconstruction quite separate.

However, the reconstruction process finishes with "ready" still
covering the tail end of the address space.  "active" has fallen of
the end, and "pending" has collapse down to an empty pain, but "ready"
is still there.

When rebuilding after an unclean shutdown, this gets cleaned up
properly, but when rebuilding onto a spare, it doesn't.

The attached patch, which can also be found at:

 http://www.cse.unsw.edu.au/~neilb/patches/linux/2.4.0-test12-pre6/patch-G-raid1rebuild

fixed the problem.  It should apply to any recent 2.4.0-test kernel.

Please try it and confirm that it works.

Thanks,

NeilBrown


--- ./drivers/md/raid1.c2000/12/06 22:34:27 1.3
+++ ./drivers/md/raid1.c2000/12/06 22:37:04 1.4
@@ -798,6 +798,32 @@
}
 }
 
+static void close_sync(raid1_conf_t *conf)
+{
+   mddev_t *mddev = conf-mddev;
+   /* If reconstruction was interrupted, we need to close the "active" and 
+"pending"
+* holes.
+* we know that there are no active rebuild requests, os cnt_active == 
+cnt_ready ==0
+*/
+   /* this is really needed when recovery stops too... */
+   spin_lock_irq(conf-segment_lock);
+   conf-start_active = conf-start_pending;
+   conf-start_ready = conf-start_pending;
+   wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock);
+   conf-start_active =conf-start_ready = conf-start_pending = 
+conf-start_future;
+   conf-start_future = mddev-sb-size+1;
+   conf-cnt_pending = conf-cnt_future;
+   conf-cnt_future = 0;
+   conf-phase = conf-phase ^1;
+   wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock);
+   conf-start_active = conf-start_ready = conf-start_pending = 
+conf-start_future = 0;
+   conf-phase = 0;
+   conf-cnt_future = conf-cnt_done;;
+   conf-cnt_done = 0;
+   spin_unlock_irq(conf-segment_lock);
+   wake_up(conf-wait_done);
+}
+
 static int raid1_diskop(mddev_t *mddev, mdp_disk_t **d, int state)
 {
int err = 0;
@@ -910,6 +936,7 @@
 * Deactivate a spare disk:
 */
case DISKOP_SPARE_INACTIVE:
+   close_sync(conf);
sdisk = conf-mirrors + spare_disk;
sdisk-operational = 0;
sdisk-write_only = 0;
@@ -922,7 +949,7 @@
 * property)
 */
case DISKOP_SPARE_ACTIVE:
-
+   close_sync(conf);
sdisk = conf-mirrors + spare_disk;
fdisk = conf-mirrors + failed_disk;
 
@@ -1213,27 +1240,7 @@
conf-resync_mirrors = 0;
}
 
-   /* If reconstruction was interrupted, we need to close the "active" and 
"pending"
-* holes.
-* we know that there are no active rebuild requests, os cnt_active == 
cnt_ready ==0
-*/
-   /* this is really needed when recovery stops too... */
-   spin_lock_irq(conf-segment_lock);
-   conf-start_active = conf-start_pending;
-   conf-start_ready = conf-start_pending;
-   wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock);
-   conf-start_active =conf-start_ready = conf-start_pending = 
conf-start_future;
-   conf-start_future = mddev-sb-size+1;
-   conf-cnt_pending = conf-cnt_future;
-   conf-cnt_future = 0;
-   conf-phase = conf-phase ^1;
-   wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock);
-   conf-start_active = conf-start_ready = conf-start_pending = 
conf-start_future = 0;
-   conf-phase = 0;
-   conf-cnt_future = conf-cnt_done;;
-   conf-cnt_done = 0;
-   spin_unlock_irq(conf-segment_lock);
-   wake_up(conf-wait_done);
+   close_sync(conf);
 
up(mddev-recovery_sem);
raid1_shrink_buffers(conf);
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: Ex2FS unable to read superblock

2000-12-03 Thread Neil Brown

On Sunday December 3, [EMAIL PROTECTED] wrote:
 I'm new to the raid under linux world, and had a question.  Sorry if several
 posts have been made by me previously, I had some trouble subscribing to the
 list...
 
  I successfully installed redhat 6.2 with raid 0 for two drives on a sun
 ultra 1.  However i'm trying to rebuild the kernel, and thought i'd play
 with 2.4test11 since it has the raid code built in, but to no avail.  while
 it will auto detect the raid drives fine from what i can tell, the kernel
 always panics with "EX2FS Unable Ro Read Superblock"  any thoughts on what i
 might be doing wrong that is causing this error.  sorry if this has been
 brought up before
 
 dave
 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]

Details might help.
e.g. a copy of the boot logs when booting under 2.2.whatever and it
working, and similar logs with it booting under 2.4.0test11 and it not
working, though they may not be as easiy to get if your root
filesystem doesn't come up.

There were issues in the 2.2 raid code on Sparc hardware which may
have been resolved by redhat, and may have been resolved differently
in 2.4.
Look for a line like:

  md.c: sizeof(mdp_super_t) = 4096

is the number 4096 in both cases?  If not, then that is probably the
problem.  It is quite possible that raid in 2.4 on sparc is not
upwards compatible with raid in redhat6.2 on sparc.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: autodetect question

2000-12-01 Thread Neil Brown

On Friday December 1, [EMAIL PROTECTED] wrote:
 If I have all of MD as a module and autodetect raid enabled, do the MD
 drives that the machine has get detected and setup
 1) at boot
 2) at module load
 or
 3) it doesn't

3.  It doesn't.
  Rationale: by the time you are loading a module, you have enough
  user space running to do the auto-detect stuff in user space.
  The simple fact that no-one has written autodetect code for user
  space yet is not a kernel issue.  I will when I need it, unless
  someone beats me to it.

 
 
 This is another question.  Is it possible to change the code so that
 autodetect works when the whole disk is part of the raid instead of a
 partition under it?  (ie: check drives that the kernel couldn't find a
 partition table on)
 
http://bible.gospelcom.net/cgi-bin/bible?passage=1COR+10:2

1 Corinthians 10

23 "Everything is permissible"--but not everything is
   beneficial. "Everything is permissible"--but not everything is
   constructive.  

Yes you could, but I don't think that you should.
If you want to boot off a raid array of whole-drives, then enable
CONFIG_MD_BOOT and boot with 
 md=0,/dev/hda,/dev/hdb
or similar.
If you want this for a non-boot drive, configure it from user-space.

NeilBrown

coming soon: partitioning for md devices:
  md=0,/dev/hda,/dev/hdb root=/dev/md0p1 swap=/dev/md0p2

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: FW: how to upgrade raid disks from redhat 4.1 to redhat 6.2

2000-12-01 Thread Neil Brown

On Friday December 1, [EMAIL PROTECTED] wrote:
  -Original Message-
  From: Carly Yang [mailto:[EMAIL PROTECTED]]
  Sent: Friday, December 01, 2000 2:42 PM
  To: [EMAIL PROTECTED]
  Subject: how to upgrade raid disks from redhat 4.1 to redhat 6.2
  
  
  Dear Gregory
  I have a system which run on redhat 4.1 with tow scsi hard 
  disks making a RAID0 partiton. I add command in 
  /etc/rc.d/rc.local as the following:
  /sbin/mdadd /dev/md0 /dev/sda1 /dev/sdb1
  /sbin/mdrun -p0 /dev/md0
  e2fsck -y /dev/md0
  mount -t ext2 /dev/md0 /home
   
  So I can access the raid disk.
  Recently I upgrade the system to redhat 6.2, I made the 
  raidtab in /etc/ as following:
  raiddev /dev/md0
  raid-level0
  nr-raid-disks2
  persistent-superblock0
  chunk-size8
   
  device/dev/sda1
  raid-disk0
  device/dev/sdb1
  raid-disk1
   
  I run "mkraid --upgrade /dev/md0" to upgrade raid partion to 
  new system. But it report error as :
  Cannot upgrade magic-less superblock on /dev/sda1 ...

I think you want raid0run.  Check the man page and see if it works for
you.

NeilBrown


  mkraid: aborted, see syslog and /proc/mdstat for potential clues.
  run "cat /proc/mdstat" get "personalities:
  read-aheas net set
  unused device: none
   
  I run "mkraid" in mandrake 7.1 and get the same result, I 
  don't know how to make a raid partition upgrade now. Could 
  tell how to do that? 
  I read your  Linux-RAID-FAQ, I think you can give me some 
  good advice.
   
  Yours Sincerely
   
  Carl
   
  
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



PATCH - md - initialisation cleanup

2000-11-29 Thread Neil Brown


Linus,
 here is a patch for test12 which cleans up the initialisation of raid
 personalities.  I didn't include it in the previous raid init cleanup
 because I hadn't figured out the inner mysteries of link order
 completely.  The linux-kbuild list helped there.

 This patch arranges that each personaility auto-initialised, and
 makes sure that all personaility are initialised before md.o gets
 initialised. 

 An earlier form of this (which didn't get the Makefile quite right)
 went into test11ac??.

NeilBrown


--- ./drivers/md/Makefile   2000/11/29 05:46:23 1.3
+++ ./drivers/md/Makefile   2000/11/29 06:04:25 1.4
@@ -16,10 +16,13 @@
 obj-n  :=
 obj-   :=
 
-# NOTE: xor.o must link *before* md.o so that auto-detect
-# of raid5 arrays works (and doesn't Oops).  Fortunately
-# they are both export-objs, so setting the order here
-# works.
+# Note: link order is important.  All raid personalities
+# and xor.o must come before md.o, as they each initialise 
+# themselves, and md.o may use the personalities when it 
+# auto-initialised.
+# The use of MIX_OBJS allows link order to be maintained even
+# though some are export-objs and some aren't.
+
 obj-$(CONFIG_MD_LINEAR)+= linear.o
 obj-$(CONFIG_MD_RAID0) += raid0.o
 obj-$(CONFIG_MD_RAID1) += raid1.o
@@ -28,10 +31,11 @@
 obj-$(CONFIG_BLK_DEV_LVM)  += lvm-mod.o
 
 # Translate to Rules.make lists.
-O_OBJS := $(filter-out $(export-objs), $(obj-y))
-OX_OBJS:= $(filter $(export-objs), $(obj-y))
-M_OBJS := $(sort $(filter-out $(export-objs), $(obj-m)))
-MX_OBJS:= $(sort $(filter  $(export-objs), $(obj-m)))
+active-objs:= $(sort $(obj-y) $(obj-m))
+
+O_OBJS := $(obj-y)
+M_OBJS := $(obj-m)
+MIX_OBJS   := $(filter $(export-objs), $(active-objs))
 
 include $(TOPDIR)/Rules.make
 
--- ./drivers/md/md.c   2000/11/29 04:55:47 1.4
+++ ./drivers/md/md.c   2000/11/29 06:04:25 1.5
@@ -3576,12 +3576,6 @@
create_proc_read_entry("mdstat", 0, NULL, md_status_read_proc, NULL);
 #endif
 }
-void hsm_init (void);
-void translucent_init (void);
-void linear_init (void);
-void raid0_init (void);
-void raid1_init (void);
-void raid5_init (void);
 
 int md__init md_init (void)
 {
@@ -3617,18 +3611,6 @@
md_register_reboot_notifier(md_notifier);
raid_table_header = register_sysctl_table(raid_root_table, 1);
 
-#ifdef CONFIG_MD_LINEAR
-   linear_init ();
-#endif
-#ifdef CONFIG_MD_RAID0
-   raid0_init ();
-#endif
-#ifdef CONFIG_MD_RAID1
-   raid1_init ();
-#endif
-#ifdef CONFIG_MD_RAID5
-   raid5_init ();
-#endif
md_geninit();
return (0);
 }
--- ./drivers/md/raid5.c2000/11/29 04:16:29 1.2
+++ ./drivers/md/raid5.c2000/11/29 06:04:25 1.3
@@ -2352,19 +2352,16 @@
sync_request:   raid5_sync_request
 };
 
-int raid5_init (void)
+static int md__init raid5_init (void)
 {
return register_md_personality (RAID5, raid5_personality);
 }
 
-#ifdef MODULE
-int init_module (void)
-{
-   return raid5_init();
-}
-
-void cleanup_module (void)
+static void raid5_exit (void)
 {
unregister_md_personality (RAID5);
 }
-#endif
+
+module_init(raid5_init);
+module_exit(raid5_exit);
+
--- ./drivers/md/linear.c   2000/11/29 05:45:04 1.1
+++ ./drivers/md/linear.c   2000/11/29 06:04:25 1.2
@@ -190,24 +190,16 @@
status: linear_status,
 };
 
-#ifndef MODULE
-
-void md__init linear_init (void)
-{
-   register_md_personality (LINEAR, linear_personality);
-}
-
-#else
-
-int init_module (void)
+static int md__init linear_init (void)
 {
-   return (register_md_personality (LINEAR, linear_personality));
+   return register_md_personality (LINEAR, linear_personality);
 }
 
-void cleanup_module (void)
+static void linear_exit (void)
 {
unregister_md_personality (LINEAR);
 }
 
-#endif
 
+module_init(linear_init);
+module_exit(linear_exit);
--- ./drivers/md/raid0.c2000/11/29 05:45:04 1.1
+++ ./drivers/md/raid0.c2000/11/29 06:04:25 1.2
@@ -333,24 +333,17 @@
status: raid0_status,
 };
 
-#ifndef MODULE
-
-void raid0_init (void)
-{
-   register_md_personality (RAID0, raid0_personality);
-}
-
-#else
-
-int init_module (void)
+static int md__init raid0_init (void)
 {
-   return (register_md_personality (RAID0, raid0_personality));
+   return register_md_personality (RAID0, raid0_personality);
 }
 
-void cleanup_module (void)
+static void raid0_exit (void)
 {
unregister_md_personality (RAID0);
 }
 
-#endif
+module_init(raid0_init);
+module_exit(raid0_exit);
+
 
--- ./drivers/md/raid1.c2000/11/29 05:45:04 1.1
+++ ./drivers/md/raid1.c2000/11/29 06:04:25 1.2
@@ -1882,19 +1882,16 @@
sync_request:   raid1_sync_request
 };
 
-int raid1_init (void)
+static int md__init raid1_init (void)
 {
return register_md_personality (RAID1, 

Re: raid1 resync problem ? (fwd)

2000-11-28 Thread Neil Brown

On Tuesday November 28, [EMAIL PROTECTED] wrote:
 Hi, 
 
 I'm forwarding the message to you guys because I got no answer from Ingo
 
 Thanks

I would suggest always CCing to [EMAIL PROTECTED]  I have
taken the liberty of Ccing this reply there.

 
 -- Forwarded message --
 Date: Sat, 25 Nov 2000 14:21:28 -0200 (BRST)
 From: Marcelo Tosatti [EMAIL PROTECTED]
 To: Ingo Molnar [EMAIL PROTECTED]
 Subject: raid1 resync problem ? 
 
 
 Hi Ingo, 
 
 While reading raid1 code from 2.4 kernel, I've found the following part on
 raid1_make_request function:
 
 ...
 spin_lock_irq(conf-segment_lock);
 wait_event_lock_irq(conf-wait_done, 
   -bh-b_rsector  conf-start_active ||   -
   -  bh-b_rsector = conf-start_future,-
 conf-segment_lock);
 if (bh-b_rsector  conf-start_active)
 conf-cnt_done++;
 else {
 conf-cnt_future++;
 if (conf-phase)
 set_bit(R1BH_SyncPhase, r1_bh-state);
 }
 spin_unlock_irq(conf-segment_lock);
 ...
 
 
 If I understood correctly, bh-b_rsector is used to know if the sector
 number of the request being processed is not inside the resync range. 
 
 In case it is, it sleeps waiting for the resync daemon. Otherwise, it can
 send the operation to the lower level block device(s). 
 
 The problem is that the code does not check for the request length to know
 if the last sector of the request is smaller than conf-start_active. 
 
 For example, if we have conf-start_active = 1000, a write request with 8
 sectors and bh-b_rsector = 905 is allowed to be done. 3 blocks (1001,
 1002 and 1003) of this request are inside the resync range. 

The reason is subtle, but this cannot happen.
resync is always done in full pages. So (on intel) start_active will
always be a multiple of 8.  Also, b_size can be at most one page
(i.e. 4096 == 8 sectors) and b_rsector will be aligned to a multiple
of b_size.  Given this, if rsector  start_active, you can be certain
that rsector+(b_size9) = start_active, so there isn't a problem and
your change is not necessary.   Adding a comment to the code to
explain this subtlety might be sensible though...

NeilBrown


 
 If haven't missed anything, we can easily fix it using the last sector
 (bh-b_rsector + (bh-b_size  9)) instead the first sector when
 comparing with conf-start_active.
 
 Waiting for your comments. 
 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: raid1 resync problem ? (fwd)

2000-11-28 Thread Neil Brown

On Tuesday November 28, [EMAIL PROTECTED] wrote:
 
 snip
 
   If I understood correctly, bh-b_rsector is used to know if the sector
   number of the request being processed is not inside the resync range. 
   
   In case it is, it sleeps waiting for the resync daemon. Otherwise, it can
   send the operation to the lower level block device(s). 
   
   The problem is that the code does not check for the request length to know
   if the last sector of the request is smaller than conf-start_active. 
   
   For example, if we have conf-start_active = 1000, a write request with 8
   sectors and bh-b_rsector = 905 is allowed to be done. 3 blocks (1001,
   1002 and 1003) of this request are inside the resync range. 
  
  The reason is subtle, but this cannot happen.
  resync is always done in full pages. So (on intel) start_active will
  always be a multiple of 8.  Also, b_size can be at most one page
  (i.e. 4096 == 8 sectors) and b_rsector will be aligned to a multiple
  of b_size.  Given this, if rsector  start_active, you can be certain
  that rsector+(b_size9) = start_active, so there isn't a problem and
  your change is not necessary.   Adding a comment to the code to
  explain this subtlety might be sensible though...
 
 This becomes a problem with kiobuf requests (I have a patch to make raid1
 code kiobuf-aware).
 
 With kiobufes, its possible (right now) to have requests up to 64kb, so
 the current code is problematic. 
 

In that case, you change sounds quite reasonable and should be
included in you patch to make raid1 kiobuf-aware.

Is there a URL to this patch? Can I look at it?

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: we are finding that parity writes are half of all writes when writing 50mb files

2000-11-28 Thread Neil Brown

On Tuesday November 28, [EMAIL PROTECTED] wrote:
 On Tue, Nov 28, 2000 at 10:50:06AM +1100, Neil Brown wrote:
 However, there is only one "unplug-all-devices"(*) call in the API
 that a reader or write can make.  It is not possible to unplug a
 particular device, or better still, to unplug a particular request.
 
 This isn't totally true. When we run out of requests on a certain
 device and we must startup the I/O to release some of them, we unplug
 only the _single_ device and we don't unplug-all-devices anymore.
 
 So during writes to disk (without using O_SYNC that isn't supported
 by 2.4.0-test11 anyways :) you never unplug-all-devices, but you only
 unplug finegrined at the harddisk level.

Thanks for these comments.  They helped think more clearly about what
was going on, and as a result I have raid5 working even faster still,
though not quite as fast as I hope...

The raid5 device has a "stripe cache" where the stripes play a similar
role to the requests used in the elevator code (__make_request).
i.e. they gather together buffer_heads for requests that are most
efficiently processed at the same time.

When I run out of stripes, I need to wait for one to become free, but
first I need to unplug any underlying devices to make sure that
something *will* become free soon.  When I unplug those devices, I
have to call "run_task_queue(tq_disk)" (because that is the only
interface), and this unplugs the raid5 device too.  This substantially
reduces the effectiveness of the plugging that I had implemented.

To get around this artifact that I unplug whenever I need a stripe, I
changed the "get-free-stripe" code so that if it has to wait for a
stripe, it waits for 16 stripes.  This means that we should be able to
get up to 16 stripes all plugged together.

This has helped a lot and I am now getting dbench thoughputs on 4K and
8K chunk sizes (still waiting for the rest of the results) that are
better than 2.2 ever gave me.
It still isn't as good as I hoped:
With 4K chunks the 3drive thoughput it significantly better than the
2drive throughput.  With 8K it is now slightly less (instead of much
less).  But in 2.2, 3drives with 8K chunks is better than 2drives with
8K chunks.

What I really want to be able to do when I need a stripe but don't
have a free one is:

   1/ unplug any underlying devices
   2/ if 50% of my stripes are plugged, unplug this device.

(or some mild variation of that).  However with the current interface,
I cannot.

Still, I suspect I can squeeze a bit more out with the current
interface, and it will be enough for 2.4.  It will be fun working to
make the interfaces just right for 2.5

NeilBrown

 
 That isn't true for reads of course, for reads it's the highlevel FS/VM layer
 that unplugs the queue and it only knows about the run_task_queue(tq_disk) but
 Jens has patches to fix it too :).

I'll have to have a look... but not today.

 
 Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



PATCH - md/Makefile - link order

2000-11-28 Thread Neil Brown


Linus,
 A couple of versions of this patch went into Alan's tree, but weren't
 quite right.  This one is minimal, but works.

 The problem is that the the tidy up of xor.o, it auto-initialises
 itself, instead of being called by raid.o, and so needs to be linked
 *before* md.o, as the initialiser for md.o may start up a raid5
 device that needs xor.

 This patch simply puts xor before md.  I would like to tidy this up
 further and have all the raid flavours auto-initialise, but there are
 issues that I have to clarify with the kbuild people before I do
 that.

 After compiling with this patch, 
   % objdump -t vmlinux | grep initcall.init
 contains:
c03345dc l O .initcall.init 0004 __initcall_calibrate_xor_block
c03345e0 l O .initcall.init 0004 __initcall_md_init
c03345e4 l O .initcall.init 0004 __initcall_md_run_setup

 in that order which convinces me that it gets the order right.

NeilBrown

--- ./drivers/md/Makefile   2000/11/29 03:46:13 1.1
+++ ./drivers/md/Makefile   2000/11/29 04:05:27 1.2
@@ -16,12 +16,16 @@
 obj-n  :=
 obj-   :=
 
-obj-$(CONFIG_BLK_DEV_MD)   += md.o
+# NOTE: xor.o must link *before* md.o so that auto-detect
+# of raid5 arrays works (and doesn't Oops).  Fortunately
+# they are both export-objs, so setting the order here
+# works.
 obj-$(CONFIG_MD_LINEAR)+= linear.o
 obj-$(CONFIG_MD_RAID0) += raid0.o
 obj-$(CONFIG_MD_RAID1) += raid1.o
 obj-$(CONFIG_MD_RAID5) += raid5.o xor.o
+obj-$(CONFIG_BLK_DEV_MD)   += md.o
 obj-$(CONFIG_BLK_DEV_LVM)  += lvm-mod.o
 
 # Translate to Rules.make lists.
 O_OBJS := $(filter-out $(export-objs), $(obj-y))
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



PATCH - raid5.c - bad calculation

2000-11-28 Thread Neil Brown


Linus, 
 I sent this patch to Alan a little while ago, but after ac4, so I
 don't know if it went into his tree.

 There is a bit of code at the front of raid5_sync_request which
 calculates which block is the parity block for a given stripe.
 However, to convert from a block number (1K units) to a sector number
 it does 2 instead of 1 or *2, which leads to the wrong results.
 This can lead to data corruption, hanging, or an Oops.

 This patch fixes it (and allows my raid5 testing to run happily to
 completion). 

NeilBrown


--- ./drivers/md/raid5.c2000/11/29 04:15:54 1.1
+++ ./drivers/md/raid5.c2000/11/29 04:16:29 1.2
@@ -1516,8 +1516,8 @@
raid5_conf_t *conf = (raid5_conf_t *) mddev-private;
struct stripe_head *sh;
int sectors_per_chunk = conf-chunk_size  9;
-   unsigned long stripe = (block_nr2)/sectors_per_chunk;
-   int chunk_offset = (block_nr2) % sectors_per_chunk;
+   unsigned long stripe = (block_nr1)/sectors_per_chunk;
+   int chunk_offset = (block_nr1) % sectors_per_chunk;
int dd_idx, pd_idx;
unsigned long first_sector;
int raid_disks = conf-raid_disks;
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



PATCH - md_boot - ifdef fix

2000-11-28 Thread Neil Brown



Linus,
 The are currently two ways to get md/raid devices configured at boot
 time.
 AUTODETECT_RAID finds bits of raid arrays from partition types and
automagically connected them together
 MD_BOOT allows bits of raid arrays to be explicitly described on the
boot line.

 Currently, MD_BOOT is not effective unless AUTODETECT_RAID is also
 enabled as both are implemented by md_run_setup, and md_run_setup is
 only called ifdef CONFIG_AUTODETECT_RAID.
 This patch fixes this irregularity.

NeilBrown

(patch against test12-pre2, as were the previous few, but I forget to mention).


--- ./drivers/md/md.c   2000/11/29 04:22:13 1.2
+++ ./drivers/md/md.c   2000/11/29 04:49:29 1.3
@@ -3853,7 +3853,7 @@
 #endif
 
 __initcall(md_init);
-#ifdef CONFIG_AUTODETECT_RAID
+#if defined(CONFIG_AUTODETECT_RAID) || defined(CONFIG_MD_BOOT)
 __initcall(md_run_setup);
 #endif
 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



PATCH - md - MAX_REAL yields to MD_SB_DISKS

2000-11-28 Thread Neil Brown


Linus,
 md currently has two #defines which give a limit to the number of
 devices that can be in a given raid array:

  MAX_REAL (==12) dates back to the time before we had persistent
 superblocks, and mostly affects raid0
 
  MD_SB_DISKS (==27) is a characteristic of the newer persistent
 superblocks and says how many devices can be described in a
 superblock.

 Have the two is inconsistent and needlessly limits raid0 arrays.
 This patch replaces MAX_REAL in the few places that it occurs with
 MD_SB_DISKS. 

 Thanks to Gary Murakami [EMAIL PROTECTED] for prodding me to
 make this patch.

NeilBrown



--- ./include/linux/raid/md_k.h 2000/11/29 04:54:32 1.1
+++ ./include/linux/raid/md_k.h 2000/11/29 04:55:47 1.2
@@ -59,7 +59,6 @@
 #error MD doesnt handle bigger kdev yet
 #endif
 
-#define MAX_REAL 12/* Max number of disks per md dev */
 #define MAX_MD_DEVS  (1MINORBITS)/* Max number of md dev */
 
 /*
--- ./include/linux/raid/raid0.h2000/11/29 04:54:32 1.1
+++ ./include/linux/raid/raid0.h2000/11/29 04:55:47 1.2
@@ -9,7 +9,7 @@
unsigned long dev_offset;   /* Zone offset in real dev */
unsigned long size; /* Zone size */
int nb_dev; /* # of devices attached to the zone */
-   mdk_rdev_t *dev[MAX_REAL]; /* Devices attached to the zone */
+   mdk_rdev_t *dev[MD_SB_DISKS]; /* Devices attached to the zone */
 };
 
 struct raid0_hash
--- ./drivers/md/md.c   2000/11/29 04:49:29 1.3
+++ ./drivers/md/md.c   2000/11/29 04:55:47 1.4
@@ -3587,9 +3587,9 @@
 {
static char * name = "mdrecoveryd";

-   printk (KERN_INFO "md driver %d.%d.%d MAX_MD_DEVS=%d, MAX_REAL=%d\n",
+   printk (KERN_INFO "md driver %d.%d.%d MAX_MD_DEVS=%d, MD_SB_DISKS=%d\n",
MD_MAJOR_VERSION, MD_MINOR_VERSION,
-   MD_PATCHLEVEL_VERSION, MAX_MD_DEVS, MAX_REAL);
+   MD_PATCHLEVEL_VERSION, MAX_MD_DEVS, MD_SB_DISKS);
 
if (devfs_register_blkdev (MAJOR_NR, "md", md_fops))
{
@@ -3639,7 +3639,7 @@
unsigned long set;
int pers[MAX_MD_BOOT_DEVS];
int chunk[MAX_MD_BOOT_DEVS];
-   kdev_t devices[MAX_MD_BOOT_DEVS][MAX_REAL];
+   kdev_t devices[MAX_MD_BOOT_DEVS][MD_SB_DISKS];
 } md_setup_args md__initdata = { 0, };
 
 /*
@@ -3713,7 +3713,7 @@
pername="super-block";
}
devnames = str;
-   for (; iMAX_REAL  str; i++) {
+   for (; iMD_SB_DISKS  str; i++) {
if ((device = name_to_kdev_t(str))) {
md_setup_args.devices[minor][i] = device;
} else {
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



PATCH - md.c - confusing message corrected

2000-11-28 Thread Neil Brown


Linus, 
  This is a resend of a patch that probably got lost a week or so ago.
  (It is also more gramatically correct).

  If md.c has two raid arrays that need to be resynced, and they share
  a physical device, then the two resyncs are serialised.  However the
  message printed says something about "overlapping" which confuses
  and worries people needlessly.  
  This patch improves the message.

NeilBrown


--- ./drivers/md/md.c   2000/11/29 04:21:37 1.1
+++ ./drivers/md/md.c   2000/11/29 04:22:13 1.2
@@ -3279,7 +3279,7 @@
if (mddev2 == mddev)
continue;
if (mddev2-curr_resync  match_mddev_units(mddev,mddev2)) {
-   printk(KERN_INFO "md: serializing resync, md%d has overlapping 
physical units with md%d!\n", mdidx(mddev), mdidx(mddev2));
+   printk(KERN_INFO "md: serializing resync, md%d shares one or 
+more physical units with md%d!\n", mdidx(mddev), mdidx(mddev2));
serialize = 1;
break;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Any experience with LSI SYM 53c1010 scsi controller??

2000-11-27 Thread Neil Brown



Hi,
 I am considering using an ASUS CUR-DLS mother board in a new
 NFS/RAID server, and wonder if anyone was any experience to report
 either with it, or with the Ultra-160 dual buss scsi controller that
 it has - the LSI SYM 53c1010.

 From what I can find in the kernel source, and from lsi logic's home
 page it is supported, but I would love to hear from someone who has
 used it.

 Thanks,
NeilBrown

http://www.asus.com.tw/products/Motherboard/pentiumpro/cur-dls/index.html
ftp://ftp.lsil.com/HostAdapterDrivers/linux/c8xx-driver/Readme

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: [BUG] reconstruction doesn't start

2000-11-27 Thread Neil Brown

On Monday November 27, [EMAIL PROTECTED] wrote:
 
 When md2 is finished then md1 is resynced. Shouldn't they do
 resync at the same time?
 
 I never saw "md: serializing resync,..." what I supected to get because
 md0 and md1 share the same physical disks.
 
 My findigs:
The md driver in 2.4.0-test11-ac4 does ALL raid-1 resyncs serialized!!! 

Close.  All *reconstructions* are serialised.  All *resyncs* are not
synchronised.
Here *reconstruction* mean when a failed disk has been placed and data
and parity is reconstructed onto it.
*resync* means that after an unclean shutdown the parity is checked
and corrected if necessary.

This is an artifact of how the code was written.  It is not something
that "should be".  It is merely something that "is".

It is on my todo list to fix this, but not very high.

NeilBrown

 
 Can someone replicate this with their system?
 
 MfG / Regards
 Friedrich Lobenstock
 
 
 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: compatability between patched 2.2 and 2.4?

2000-11-07 Thread Neil Brown

On Tuesday November 7, [EMAIL PROTECTED] wrote:
 
 I have a question regarding the diffrences between the 2.2+RAID-patch
 kernels and the 2.4-test kernels - I was wondering if there are any
 diffrences between them.
 
 For example, if I build systems with a 2.2.17+RAID and later install 2.4
 kernels on them will the trasition be seemless as far as RAID goes?
 
 Thx, marc

Transition should be fine - unless you are using a sparc.

But then for sparc, the 2.2 patch didn't really work properly unless
you hacked the superblock layout, so you probably know what you are
doing anyway.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



PATCH: raid1 - assorted bug fixes

2000-11-06 Thread Neil Brown



Linus,

 The following patch addresses a small number of bugs in raid1.c in
 2.4.0-test10.

 1/ A number of routines that are called from interrupt context used
 spin_lock_irq / spin_unlock_irq
   instead of the more appropriate
 spin_lock_irqsave( ,flags)   /  spin_unlock_irqrestore( ,flags)

   This can, and did, lead to deadlocks on an SMP system.

 2/ b_rsector and b_rdev are used in a couple of cases *after*
generic_make_request has been called.  If the underlying devices
was, for example, RAID0, these fields would no longer have the
assumed values.  I have changed these cases to use b_blocknr
(scales) and b_dev. 

This bug could affect correctness if raid1 is used over raid0 or
raid-linear or LVM.

 3/ In two cases, b_blocknr is calculated by *multiplying* b_rsector
by the sector-per-block count instead of *dividing* it.

This bug could affect correctness when restarted a read request
after a drive failure.

NeilBrown

--- ./drivers/md/raid1.c2000/11/07 02:14:25 1.1
+++ ./drivers/md/raid1.c2000/11/07 02:15:21 1.2
@@ -91,7 +91,8 @@
 
 static inline void raid1_free_bh(raid1_conf_t *conf, struct buffer_head *bh)
 {
-   md_spin_lock_irq(conf-device_lock);
+   unsigned long flags;
+   spin_lock_irqsave(conf-device_lock, flags);
while (bh) {
struct buffer_head *t = bh;
bh=bh-b_next;
@@ -103,7 +104,7 @@
conf-freebh_cnt++;
}
}
-   md_spin_unlock_irq(conf-device_lock);
+   spin_unlock_irqrestore(conf-device_lock, flags);
wake_up(conf-wait_buffer);
 }
 
@@ -182,10 +183,11 @@
r1_bh-mirror_bh_list = NULL;
 
if (test_bit(R1BH_PreAlloc, r1_bh-state)) {
-   md_spin_lock_irq(conf-device_lock);
+   unsigned long flags;
+   spin_lock_irqsave(conf-device_lock, flags);
r1_bh-next_r1 = conf-freer1;
conf-freer1 = r1_bh;
-   md_spin_unlock_irq(conf-device_lock);
+   spin_unlock_irqrestore(conf-device_lock, flags);
} else {
kfree(r1_bh);
}
@@ -229,14 +231,15 @@
 
 static inline void raid1_free_buf(struct raid1_bh *r1_bh)
 {
+   unsigned long flags;
struct buffer_head *bh = r1_bh-mirror_bh_list;
raid1_conf_t *conf = mddev_to_conf(r1_bh-mddev);
r1_bh-mirror_bh_list = NULL;

-   md_spin_lock_irq(conf-device_lock);
+   spin_lock_irqsave(conf-device_lock, flags);
r1_bh-next_r1 = conf-freebuf;
conf-freebuf = r1_bh;
-   md_spin_unlock_irq(conf-device_lock);
+   spin_unlock_irqrestore(conf-device_lock, flags);
raid1_free_bh(conf, bh);
 }
 
@@ -371,7 +374,7 @@
 {
struct buffer_head *bh = r1_bh-master_bh;
 
-   io_request_done(bh-b_rsector, mddev_to_conf(r1_bh-mddev),
+   io_request_done(bh-b_blocknr*(bh-b_size9), mddev_to_conf(r1_bh-mddev),
test_bit(R1BH_SyncPhase, r1_bh-state));
 
bh-b_end_io(bh, uptodate);
@@ -599,7 +602,7 @@
 
bh_req = r1_bh-bh_req;
memcpy(bh_req, bh, sizeof(*bh));
-   bh_req-b_blocknr = bh-b_rsector * sectors;
+   bh_req-b_blocknr = bh-b_rsector / sectors;
bh_req-b_dev = mirror-dev;
bh_req-b_rdev = mirror-dev;
/*  bh_req-b_rsector = bh-n_rsector; */
@@ -643,7 +646,7 @@
/*
 * prepare mirrored mbh (fields ordered for max mem throughput):
 */
-   mbh-b_blocknr= bh-b_rsector * sectors;
+   mbh-b_blocknr= bh-b_rsector / sectors;
mbh-b_dev= conf-mirrors[i].dev;
mbh-b_rdev   = conf-mirrors[i].dev;
mbh-b_rsector= bh-b_rsector;
@@ -1181,7 +1184,7 @@
struct buffer_head *bh1 = mbh;
mbh = mbh-b_next;
generic_make_request(WRITE, bh1);
-   md_sync_acct(bh1-b_rdev, bh1-b_size/512);
+   md_sync_acct(bh1-b_dev, bh1-b_size/512);
}
} else {
dev = bh-b_dev;
@@ -1406,7 +1409,7 @@
init_waitqueue_head(bh-b_wait);
 
generic_make_request(READ, bh);
-   md_sync_acct(bh-b_rdev, bh-b_size/512);
+   md_sync_acct(bh-b_dev, bh-b_size/512);
 
return (bsize  10);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]



Re: Linux 2.4.0-test8 and swap/journaling fs on raid

2000-09-27 Thread Neil Brown

On Wednesday September 27, [EMAIL PROTECTED] wrote:
 I was just wondering if the issues with swap on a raid device and with using a
 journaling fs on a raid device had been fixed in the latest 2.4.0-test
 kernels?

Yes.  md in 2.4 doesn't do interesting things with the buffer cache,
so swap and journaling filesystems should have no issues with it.


 
 I've gone too soft over the last few years to read the raid code myself :-)
 
 Thanks in advance,
 
 Craig
 
 PS I might be able to make sense of the code, but my wife would kill me if I
 spent any more time on the computer.

So print it out and read it that way :-)  Following cross references
is a bit slow though...  maybe a source browser on a palm pilot so you
can still do it in the family room...

(At this point 5 people chime in and tell me about 3 different
whizz-bang packages which print C source with line numbers and cross
references and )

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



<    5   6   7   8   9   10