Re: Spare disk could not sleep / standby
On Tuesday March 8, [EMAIL PROTECTED] wrote: Neil Brown wrote: It is writes, but don't be scared. It is just super-block updates. In 2.6, the superblock is marked 'clean' whenever there is a period of about 20ms of no write activity. This increases the chance on a resync won't be needed after a crash. (unfortunately) the superblocks on the spares need to be updated too. Ack, one of the cool things that a linux md array can do that others can't is imho that the disks can spin down when inactive. Granted, it's mostly for home users who want their desktop RAID to be quiet when it's not in use, and their basement multi-terabyte facility to use a minimum of power when idling, but anyway. Is there any particular reason to update the superblocks every 20 msecs when they're already marked clean? It doesn't (well, shouldn't and I don't think it does). Before the first write, they are all marked 'active'. Then after 20ms with no write, they are all marked 'clean'. Then before the next write they are all marked 'active'. As the event count needs to be updated every time the superblock is modified, the event count will be updated forever active-clean or clean-active transition. All the drives in an array must have the same value for the event count, so the spares need to be updated even though they, themselves, aren't exactly 'active' or 'clean'. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Spare disk could not sleep / standby
On Tuesday March 8, [EMAIL PROTECTED] wrote: Neil Brown wrote: Then after 20ms with no write, they are all marked 'clean'. Then before the next write they are all marked 'active'. As the event count needs to be updated every time the superblock is modified, the event count will be updated forever active-clean or clean-active transition. So.. Sorry if I'm a bit slow here.. But what you're saying is: The kernel marks the partition clean when all writes have expired to disk. This change is propagated through MD, and when it is, it causes the event counter to rise, thus causing a write, thus marking the superblock active. 20 msecs later, the same scenario repeats itself. Is my perception of the situation correct? No. Writing the superblock does not cause the array to be marked active. If the array is idle, the individual drives will be idle. Seems like a design flaw to me, but then again, I'm biased towards hating this behaviour since I really like being able to put inactive RAIDs to sleep.. Hmmm... maybe I misunderstood your problem. I thought you were just talking about a spare not being idle when you thought it should be. Are you saying that your whole array is idle, but still seeing writes? That would have to be something non-md-specific I think. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Creating RAID1 with missing - mdadm 1.90
On Saturday March 5, [EMAIL PROTECTED] wrote: What might the proper [or functional] syntax be to do this? I'm running 2.6.10-1.766-FC3, and mdadm 1.90. It would help if you told us what you tried as then we could possible give a more focussed answer, however: mdadm --create /dev/md1 --level=raid1 --raid-devices=2 /dev/sda3 missing might be the sort of thing you want. NeilBrown Thanks for the time. b- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid-6 hang on write.
On Tuesday March 1, [EMAIL PROTECTED] wrote: Neil Brown wrote: Could you please confirm if there is a problem with 2.6.11-rc4-bk4-bk10 as reported, and whether it seems to be the same problem. Ok.. are we all ready? I had applied your development patches to all my vanilla 2.6.11-rc4-* kernels. Thus they all exhibited the same problem in the same way as -mm1. Smacks forehead against wall repeatedly Thanks for following through with this so we know exactly where the problem is ... and isn't. And admitting your careless mistake in public is a great example to all the rest of us who are too shy to do so - thanks :-) Oh well, at least we now know about a bug in the -mm patches. Yes, and very helpful to know it is. Thanks again. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Joys of spare disks!
On Wednesday March 2, [EMAIL PROTECTED] wrote: Is there any sound reason why this is not feasible? Is it just that someone needs to write the code to implement it? Exactly (just needs to be implemented). NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid-6 hang on write.
On Friday February 25, [EMAIL PROTECTED] wrote: Turning on debugging in raid6main.c and md.c make it much harder to hit. So I'm assuming something timing related. raid6d -- md_check_recovery -- generic_make_request -- make_request -- get_active_stripe Yes, there is a real problem here. I see if I can figure out the best way to remedy it... However I think you reported this problem against a non -mm kernel, and the path from md_check_recovery to generic_make_requests only exists in -mm. Could you please confirm if there is a problem with 2.6.11-rc4-bk4-bk10 as reported, and whether it seems to be the same problem. Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 0 of 9] Introduction
On Friday February 18, [EMAIL PROTECTED] wrote: Would you recommend to apply this package http://neilb.web.cse.unsw.edu.au/~neilb/patches/linux-devel/2.6/2005-02-18-00/patch-all-2005-02-18-00 To a 2.6.10 kernel? No. I don't think it would apply. That patch it mostly experimental stuff. Only apply it if you want to experiment with the bitmap resync code. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 9 of 9] Optimise reconstruction when re-adding a recently failed drive.
On Thursday February 17, [EMAIL PROTECTED] wrote: NeilBrown wrote: When an array is degraded, bit in the intent-bitmap are never cleared. So if a recently failed drive is re-added, we only need to reconstruct the block that are still reflected in the bitmap. This patch adds support for this re-adding. Hi there - If I understand this correctly, this means that: 1) if I had a raid1 mirror (for example) that has no writes to it since a resync 2) a drive fails out, and some writes occur 3) when I re-add the drive, only the areas where the writes occurred would be re-synced? I can think of a bunch of peripheral questions around this scenario, and bad sectors / bad sector clearing, but I may not be understanding the basic idea, so I wanted to ask first. You seem to understand the basic idea. I believe one of the motivators for this code (I didn't originate it) is when a raid1 has one device locally and one device over a network connection. If the network connection breaks, that device has to be thrown out. But when it comes back, we don't want to resync the whole array over the network. This functionality helps there (though there are a few other things needed before that scenario can work smoothly). You would only re-add a device if you thought it was OK. i.e. if it was a connection problem rather than a media problem, or if you had resolved any media issues. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.11-rc4 md loops on missing drives
On Tuesday February 15, [EMAIL PROTECTED] wrote: G'day all, I'm not really sure how it's supposed to cope with losing more disks than planned, but filling the syslog with nastiness is not very polite. Thanks for the bug report. There are actually a few problems relating to resync/recovery when an array (raid 5 or 6) has lost too many devices. This patch should fix them. NeilBrown Make raid5 and raid6 robust against failure during recovery. Two problems are fixed here. 1/ if the array is known to require a resync (parity update), but there are too many failed devices, the resync cannot complete but will be retried indefinitedly. 2/ if the array has two many failed drives to be usable and a spare is available, reconstruction will be attempted, but cannot work. This also is retried indefinitely. Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c| 12 ++-- ./drivers/md/raid5.c | 13 + ./drivers/md/raid6main.c | 12 3 files changed, 31 insertions(+), 6 deletions(-) diff ./drivers/md/md.c~current~ ./drivers/md/md.c --- ./drivers/md/md.c~current~ 2005-02-16 11:25:25.0 +1100 +++ ./drivers/md/md.c 2005-02-16 11:25:31.0 +1100 @@ -3655,18 +3655,18 @@ void md_check_recovery(mddev_t *mddev) /* no recovery is running. * remove any failed drives, then -* add spares if possible +* add spares if possible. +* Spare are also removed and re-added, to allow +* the personality to fail the re-add. */ - ITERATE_RDEV(mddev,rdev,rtmp) { + ITERATE_RDEV(mddev,rdev,rtmp) if (rdev-raid_disk = 0 - rdev-faulty + (rdev-faulty || ! rdev-in_sync) atomic_read(rdev-nr_pending)==0) { if (mddev-pers-hot_remove_disk(mddev, rdev-raid_disk)==0) rdev-raid_disk = -1; } - if (!rdev-faulty rdev-raid_disk = 0 !rdev-in_sync) - spares++; - } + if (mddev-degraded) { ITERATE_RDEV(mddev,rdev,rtmp) if (rdev-raid_disk 0 diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2005-02-16 11:25:25.0 +1100 +++ ./drivers/md/raid5.c2005-02-16 11:25:31.0 +1100 @@ -1491,6 +1491,15 @@ static int sync_request (mddev_t *mddev, unplug_slaves(mddev); return 0; } + /* if there is 1 or more failed drives and we are trying +* to resync, then assert that we are finished, because there is +* nothing we can do. +*/ + if (mddev-degraded = 1 test_bit(MD_RECOVERY_SYNC, mddev-recovery)) { + int rv = (mddev-size 1) - sector_nr; + md_done_sync(mddev, rv, 1); + return rv; + } x = sector_nr; chunk_offset = sector_div(x, sectors_per_chunk); @@ -1882,6 +1891,10 @@ static int raid5_add_disk(mddev_t *mddev int disk; struct disk_info *p; + if (mddev-degraded 1) + /* no point adding a device */ + return 0; + /* * find the disk ... */ diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c --- ./drivers/md/raid6main.c~current~ 2005-02-16 11:25:25.0 +1100 +++ ./drivers/md/raid6main.c2005-02-16 11:25:31.0 +1100 @@ -1650,6 +1650,15 @@ static int sync_request (mddev_t *mddev, unplug_slaves(mddev); return 0; } + /* if there are 2 or more failed drives and we are trying +* to resync, then assert that we are finished, because there is +* nothing we can do. +*/ + if (mddev-degraded = 2 test_bit(MD_RECOVERY_SYNC, mddev-recovery)) { + int rv = (mddev-size 1) - sector_nr; + md_done_sync(mddev, rv, 1); + return rv; + } x = sector_nr; chunk_offset = sector_div(x, sectors_per_chunk); @@ -2048,6 +2057,9 @@ static int raid6_add_disk(mddev_t *mddev int disk; struct disk_info *p; + if (mddev-degraded 2) + /* no point adding a device */ + return 0; /* * find the disk ... */ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with Openmosix
On Monday February 14, [EMAIL PROTECTED] wrote: Hi, Neil... Hi. I use MD driver two year ago with Debian, and run perfectly. Great! The machine boot the new kernel a run Ok... but... if I (or another process) make a change/write to the raid md system, the computer crash with the message: hdh: Drive not ready for command. (hdh is the mirror raid1 for hdf disk). I cannot help thinking that maybe the Drive is not ready for the command. i.e. it isn't an md problem. It isn't an openmosix problem. It is a drive hardware problem, or maybe an IDE controller problem. Can you try a different drive? Can you try just putting a filesystem on that drive alone and see if it works? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Bugme-new] [Bug 4211] New: md configuration destroys disk GPT label
On Monday February 14, [EMAIL PROTECTED] wrote: Maybe I am confused, but if you use the whole disk, I would expect the whole disk could be over-written! What am I missing? I second that. Once you do anything to a whole disk, whether make an md array out of it, or mkfs it or anything else, you can kiss any partitioning goodbye. Maybe what you want to do it make an md array and then partition that. In 2.6 you can do that directly. In 2.4 you would need to use LVM to partition the array. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ANNOUNCE: mdadm 1.9.0 - A tool for managing Soft RAID under Linux
I am pleased to announce the availability of mdadm version 1.9.0 It is available at http://www.cse.unsw.edu.au/~neilb/source/mdadm/ and http://www.{countrycode}.kernel.org/pub/linux/utils/raid/mdadm/ as a source tar-ball and (at the first site) as an SRPM, and as an RPM for i386. mdadm is a tool for creating, managing and monitoring device arrays using the md driver in Linux, also known as Software RAID arrays. Release 1.9.0 adds: - Fix rpm build problem (stray %) - Minor manpage updates - Change dirty status to active as it was confusing people. - --assemble --auto recognises 'standard' name and insists on using the appropriate major/minor number for them. - Remove underscore from partition names, so partitions of foo are foo1, foo2 etc (unchanged) and partitions of f00 are f00p1, f00p2 etc rather than f00_p1... - Use major, minor, makedev macros instead of MAJOR, MINOR, MKDEV so that large device numbers work on 2.6 (providing you have glibc 2.3.3 or later). - Add some missing closes of open file descriptors. - Reread /proc/partition for every array assembled when using it to find devices, rather than only once. - Make mdadm -Ss stop stacked devices properly, by reversing the order in which arrays are stopped. - Improve some error messages. - Allow device name to appear before first option, so e.g. mdadm /dev/md0 -A /dev/sd[ab] works. - Assume '-Q' if just a device is given, rather than being silent. This is based on 1.8.0 and *not* on 1.8.1 which was meant to be a pre-release for the upcoming 2.0.0. The next prerelease will have a more obvious name. Development of mdadm is sponsored by [EMAIL PROTECTED]: The School of Computer Science and Engineering at The University of New South Wales NeilBrown 04 February 2005 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ANNOUNCE: mdadm 1.9.0 - A tool for managing Soft RAID under Linux
On Friday February 4, [EMAIL PROTECTED] wrote: Neil Brown wrote: Release 1.9.0 adds: ... - --assemble --auto recognises 'standard' name and insists on using the appropriate major/minor number for them. Is this the problem I encountered when I added auto=md to my mdadm.conf file? Probably. It caused all sorts of problems - which were recoverable, fortunately. I ended up putting a '/sbin/MAKEDEV md' into /etc/rc.sysinit just before the call to mdadm, but that creates all the md devices, not just those that are needed. Will this new version allow me to remove this line in rc.sysinit again and put the 'auto=md' back into mdadm.conf? I think so, yes. It is certainly worth a try and I would appreciate success of failure reports. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Change preferred minor number of an md device?
On Monday January 31, [EMAIL PROTECTED] wrote: Hi to all, md gurus! Is there a way to edit the preferred minor of a stopped device? mdadm --assemble /dev/md0 --update=super-minor /dev/ will assemble the array and update the preferred minor to 0 (from /dev/md0). However this won't work for you as you already have a /dev/md0 running... Alternatively, is there a way to create a raid1 device specifying the preferred minor number md0, but activating it provisionally as a different minor, say md5? An md0 is already running, so mdadm --create /dev/md0 fails... I have to dump my /dev/md0 to a different disk (/dev/md5), but when I boot from the new disk, I want the kernel to autmatically detect the device as /dev/md0. If you are running 2.6, then you just need to assemble it as /dev/md0 once and that will automatically update the superblock. You could do this with kernel parameters of raid=noautodetect md=0,/dev/firstdrive,/dev/seconddrive NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: /dev/md* Device Files
On Wednesday January 26, [EMAIL PROTECTED] wrote: A useful trick I discovered yesterday: Add --auto to your mdadm commandline and it will create the device for you if it is missing :) Well, it seems that this machine is using the udev scheme for managing device files. I didn't realize this as udev is new to me, but I probably should have mentioned the kernel version (2.6.8) I was using. So I need to research udev and how one causes devices to be created, etc. Beware udev has an understanding of how device files are meant to work which is quite different from how md actually works. udev thinks that devices should appear in /dev after the device is actually known to exist in the kernel. md needs a device to exist in /dev before the kernel can be told that it exists. This is one of the reasons that --auto was added to mdadm - to bypass udev. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Software RAID 0+1 with mdadm.
On Wednesday January 26, [EMAIL PROTECTED] wrote: This bug that's fixed in 1.9.0, is in a bug when you create the array? ie do we need to use 1.9.0 to create the array. I'm looking to do the same but my bootdisk currently only has 1.7.soemthing on it. Do I need to make a custom bootcd with 1.9.0 on it? This issue that will be fixed in 1.9.0 has nothing to do with creating the array. It is only relevant for stacked arrays (e.g. a raid0 made out of 2 or more raid1 arrays), and only if you are using mdadm --assemble --scan (or similar) to assemble your arrays, and you specify the devices to scan in mdadm.conf as DEVICES partitions (i.e. don't list actual devices, just say to get them from the list of known partitions). So, no: no need for a custom bootcd. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software RAID 0+1 with mdadm.
On Tuesday January 25, [EMAIL PROTECTED] wrote: Been trying for days to get a software RAID 0+1 setup. This is on SuSe 9.2 with kernel 2.6.8-24.11-smp x86_64. I am trying to setup a RAID 0+1 with 4 250gb SATA drives. I do the following: mdadm --create /dev/md1 --level=0 --chunk=4 --raid-devices=2 /dev/sdb1 /dev/sdc1 mdadm --create /dev/md2 --level=0 --chunk=4 --raid-devices=2 /dev/sdd1 /dev/sde1 mdadm --create /dev/md0 --level=1 --chunk=4 --raid-devices=2 /dev/md1 /dev/md2 This all works fine and I can mkreiserfs /dev/md0 and mount it. If I am then to reboot /dev/md1 and /dev/md2 will show up in the /proc/mdstat but not /dev/md0. So I create a /etc/mdadm.conf like so to see if this will work: DEVICE partitions DEVICE /dev/md* ARRAY /dev/md2 level=raid0 num-devices=2 UUID=5e6efe7d:6f5de80b:82ef7843:148cd518 devices=/dev/sdd1,/dev/sde1 ARRAY /dev/md1 level=raid0 num-devices=2 UUID=e81e74f9:1cf84f87:7747c1c9:b3f08a81 devices=/dev/sdb1,/dev/sdc1 ARRAY /dev/md0 level=raid1 num-devices=2 devices=/dev/md2,/dev/md1 Everything seems ok after boot. But again no /dev/md0 in /proc/mdstat. But then if I do a mdadm --assemble --scan it will then load /dev/md0. My guess is that you are (or SuSE is) relying on autodetect to assemble the arrays. Autodetect cannot assemble an array made of other arrays. Just an array made of partitions. If you disable the autodetect stuff and make sure mdadm --assemble --scan is in a boot-script somewhere, it should just work. Also, you don't really want the device=/dev/sdd1... entries in mdadm.conf. They tell mdadm to require the devices to have those names. If you add or remove scsi drives at all, the names can change. Just rely on the UUID. Also do I need to create partitions? Or can I setup the whole drives as the array? You don't need partitions. I have since upgraded to mdadm 1.8 and setup a RAID10. However I need something that is production worthy. Is a RAID10 something I could rely on as well? Also under a RAID10 how do you tell it which drives you want mirrored? raid10 is 2.6 only, but should be quite stable. You cannot tell it which drives to mirror because you shouldn't care. You just give it a bunch of identical drives and let it put the data where it wants. If you really want to care (and I cannot imagine why you would - all drives in a raid10 are likely to get similar load) then you have to build it by hand - a raid0 of multiple raid1s. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migrating raid-1 to different drive geometry ?
On Monday January 24, [EMAIL PROTECTED] wrote: how can the existing raid setup be moved to the new disks without data loss ? I guess it must be something like this: 1) physically remove first old drive 2) physically add first new drive 3) re-create partitions on new drive 4) run raidhotadd for each partition 5) wait until all partitions synced 6) repeat with second drive Sounds good. the big question is: since the drive geometry will definitely different between old 60GB and new 80GB drive(s), how do the new partitions have to be created on the new drive ? - do they have to have exactly the same amount of blocks ? No. - may they be bigger ? Yes (they cannot be smaller). However making the partitions bigger will not make the arrays bigger. If you are using a recent 2.6 kernel and mdadm 1.8.0, you can grow the array with mdadm --grow /dev/mdX --size=max You will then need to convince the filesystem in the array to make use of the extra space. Many filesystems do support such growth. Some even support on-line growth. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migrating raid-1 to different drive geometry ?
On Tuesday January 25, [EMAIL PROTECTED] wrote: Neil Brown wrote: If you are using a recent 2.6 kernel and mdadm 1.8.0, you can grow the array with mdadm --grow /dev/mdX --size=max Neil, Is this just for RAID1? OR will it work for RAID5 too? --grow --size=max should work for raid 1,5,6. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid 01 vs 10
On Monday July 9, [EMAIL PROTECTED] wrote: I was wondering what people thought of using raid 0+1 (a mirrored array of raid0 stripes) vs. raid 1+0 (a raid0 array of mirrored disks). It seems that either can sustain at least one drive failure and the performance should be similar. Are there strong reasons for using one over the other? All other things being equal, raid 1+0 is usually better. It can withstand a greater variety of 2 disc failures and the separate arrays can rebuild in parallel after an unclean shutdown, thus returning you to full redundancy more quickly. But some times other things are not equal. If you don't have uniform drive sizes, you might want to raid0 assorted drives together to create two similar sized sets to raid1. I recall once someone suggesting that with certain cabling geometries it was better to use 0+1 in cases of cable failure, but I cannot remember, or work out, how that might have been. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
PATCH/RFC - partitioning of md devices
Linus, I wonder if you would consider applying, or commenting on this patch. It adds support for partitioning md devices. In particular, a new major device is created (name==mdp, number assigned dynamically) which provides for 15 partitions on each of the first 16 md devices. I understand that a more uniform approach to partitioning might get introduced in 2.5, but this seems the best approach for 2.4. This is particularly useful if you want to have a mirrored boot drive, rather than two drives with lots of mirrored partitions. It is also useful for supporting what I call winware raid, which is the raid-controller requivalent of winmodems - minimal hardware and most of the support done in software. Among the things that this patch does are: 1/ tidy up some terminology. Currently there is a one-to-one mapping between minor numbers and raid arrays or units, so the term minor is used when referring to either read minor number or to a unit. This patch introduces the term unit to be used to identify which particular array is being referred to, and keeps minor just for when a minor device number is realy implied. 2/ When reporting the geometry of a partitioned raid1 array, the geometry of the underlying device is reported. For all other arrays the 2x4xLARGE geometry is maintained. 3/ The hardsectsize of partitions in a RAID5 array is set the the PAGESIZE because raid5 doesn't cope well with receiving requests with different blocksizes. 4/ The new device reports a name of md (via hd_struct-major_name) so partitions look like mda3 or md/disc0/part3, but registers the name mdp so that /proc/devices shows the major number next to mdp. 5/ devices ioctls for re-reading the partition table and setting partition table information. --- ./include/linux/raid/md.h 2001/07/01 22:59:38 1.1 +++ ./include/linux/raid/md.h 2001/07/01 22:59:47 1.2 @@ -61,8 +61,11 @@ extern int md_size[MAX_MD_DEVS]; extern struct hd_struct md_hd_struct[MAX_MD_DEVS]; -extern void add_mddev_mapping (mddev_t *mddev, kdev_t dev, void *data); -extern void del_mddev_mapping (mddev_t *mddev, kdev_t dev); +extern int mdp_size[MAX_MDP_DEVSMDP_MINOR_SHIFT]; +extern struct hd_struct mdp_hd_struct[MAX_MDP_DEVSMDP_MINOR_SHIFT]; + +extern void add_mddev_mapping (mddev_t *mddev, int unit, void *data); +extern void del_mddev_mapping (mddev_t *mddev, int unit); extern char * partition_name (kdev_t dev); extern int register_md_personality (int p_num, mdk_personality_t *p); extern int unregister_md_personality (int p_num); --- ./include/linux/raid/md_k.h 2001/07/01 22:59:38 1.1 +++ ./include/linux/raid/md_k.h 2001/07/01 22:59:47 1.2 @@ -15,6 +15,7 @@ #ifndef _MD_K_H #define _MD_K_H + #define MD_RESERVED 0UL #define LINEAR1UL #define STRIPED 2UL @@ -60,7 +61,10 @@ #error MD doesnt handle bigger kdev yet #endif +#defineMDP_MINOR_SHIFT 4 + #define MAX_MD_DEVS (1MINORBITS)/* Max number of md dev */ +#define MAX_MDP_DEVS (1(MINORBITS-MDP_MINOR_SHIFT)) /* Max number of md dev */ /* * Maps a kdev to an mddev/subdev. How 'data' is handled is up to @@ -73,11 +77,17 @@ extern dev_mapping_t mddev_map [MAX_MD_DEVS]; +extern int mdp_major; static inline mddev_t * kdev_to_mddev (kdev_t dev) { - if (MAJOR(dev) != MD_MAJOR) + int unit=0; + if (MAJOR(dev) == MD_MAJOR) + unit = MINOR(dev); + else if (MAJOR(dev) == mdp_major) + unit = MINOR(dev) MDP_MINOR_SHIFT; + else BUG(); -return mddev_map[MINOR(dev)].mddev; + return mddev_map[unit].mddev; } /* @@ -191,7 +201,7 @@ { void*private; mdk_personality_t *pers; - int __minor; + int __unit; mdp_super_t *sb; int nb_dev; struct md_list_head disks; @@ -248,13 +258,34 @@ */ static inline int mdidx (mddev_t * mddev) { - return mddev-__minor; + return mddev-__unit; +} + +static inline int mdminor (mddev_t *mddev) +{ + return mdidx(mddev); +} + +static inline int mdpminor (mddev_t *mddev) +{ + return mdidx(mddev) MDP_MINOR_SHIFT; +} + +static inline kdev_t md_kdev (mddev_t *mddev) +{ + return MKDEV(MD_MAJOR, mdminor(mddev)); } -static inline kdev_t mddev_to_kdev(mddev_t * mddev) +static inline kdev_t mdp_kdev (mddev_t *mddev, int part) { - return MKDEV(MD_MAJOR, mdidx(mddev)); + return MKDEV(mdp_major, mdpminor(mddev)+part); } + +#define foreach_part(tmp,mddev)\ + if (mdidx(mddev)MAX_MDP_DEVS) \ + for(tmp=mdpminor(mddev);\ + tmpmdpminor(mddev)+(1MDP_MINOR_SHIFT); \ + tmp++) extern
Re: Failed disk triggers raid5.c bug?
On Monday June 25, [EMAIL PROTECTED] wrote: Is there any way for the RAID code to be smarter when deciding about those event counters? Does it have any chance (theoretically) to _know_ that it shouldn't use the drive with event count 28? My current thinking is that once a raid array becomes unusable - in the case of raid5, this means two failures - the array should immediately be marked read-only, including the superblocks. Then if you ever manage to get enough drives together to form a working array, it will start working again, and if not, it won't really matter whether the superblock was updated to not. And even if that can't be done automatically, what about a small utility for the admin where he can give some advise to support the RAID code on those decisions? Will mdctl have this functionality? That would be great! mdctl --assemble will have a --force option to tell it to ignore event numbers and assemble the array anyway. This could result in data corruption if you include an old disc, but would be able to get you out of a tight spot. Ofcourse, once the above change goes into the kernel it shouldn't be necessary. Hm, does the RAID code disable a drive on _every_ error condition? Isn't there a distinction between, let's say, soft errors and hard errors? (I have to admit I don't know the internals of Linux device drivers enough to answer that question) Shouldn't the RAID code leave a drive which reports soft errors in the array and disable drives with hard errors only? A Linux block device doesn't report soft errors. There is either success or failure. The driver for the disc drive should retry any soft errors and only report an error up through the block-device layer when it is definately hard. Arguably the RAID layer should catch read errors and try to get the data from elsewhere and then re-write over the failed read, just incase it was a single block error. But a write error should always be fatal and fail the drive. I cannot think of any other reasonable approach. In that case, the filesystem might have been corrupt, but the array would have been re-synced automatically, wouldn't it? yes, and it would have if it hadn't collapsed in a heap while trying :-( But why did the filesystem ask for a block that was out of range? This is the part that I cannot fathom. It would seem as though the filesystem got corrupt somehow. Maybe an indirect block got replaced with garbage, and ext2fs believed the indirect block and went seeking way off the end of the array. But I don't know how the corruption happened. Perhaps the read errors from the drive triggered that problem? They shouldn't do, but seeing don't know where the corruption came from, and I'm not even 100% confident that there was corruption, maybe they could. The closest I can come to a workable scenario is that maybe some parity block had the wrong data. Normally this wouldn't be noticed, but when you have a failed drive you have to use the parity to calculate the value of a missing block, and bad parity would make this block bad. But I cannot imaging who you would have a bad parity block. After any unclean shutdown the parity should be recalculated. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
Re: Mounting very old style raid on a recent machine?
On Tuesday June 26, [EMAIL PROTECTED] wrote: Hi, I currently have to salvage data from an ancient box that looks like to have run kernel 2.0.35. However, the system on that disk is corrupted and won't boot any more (at least not on today's hardware). It looks like main data is on a RAID. /etc/mdtab: |/dev/md0 linear /dev/hda3 /dev/hdb2 Can I access that RAID from a current system running kernel 2.2 or 2.4? Do I have to build a new 2.0 kernel? What type of raidtools do I need to activate that RAID? You should be able to access this just fine from a 2.2 kernel using raidtools-0.41 from http://www.kernel.org/pub/linux/daemons/raid/ If you need to use 2.4, then you should still be able to access it using raidtools 0.90 from http://www.kernel.org/pub/linux/daemons/raid/alpha/ in this case you would need an /etc/raidtab like - raiddev /dev/md0 raid-level linear nr-raid-disks 2 persistent-superblock 0 device /dev/hda3 raid-disk 0 device /dev/hdb2 raid-disk 1 --- Note that this is pure theory. I have never actually done it myself. It should be quite safe to experiment. You are unlikely to corrupt anything if you don't to anything outrageously silly like telling it that it is a raid1 or raid5 array. Note: the persistent-superblock 0 is fairly important. These older arrays did not have any raid-superblock on the device. You want to make sure you don't accidentally write one and so corrupt data. I would go for a 2.2 kernel, raidtools 0.41 and the command: mdadd -r -pl /dev/md0 /dev/hda3 /dev/hdb2 NeilBrown Any hints will be appreciated. Greetings Marc -- - Marc Haber | I don't trust Computers. They | Mailadresse im Header Karlsruhe, Germany | lose things.Winona Ryder | Fon: *49 721 966 32 15 Nordisch by Nature | How to make an American Quilt | Fax: *49 721 966 31 29 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
Re: PATCH - raid5 performance improvement - 3 of 3
On Sunday June 24, [EMAIL PROTECTED] wrote: Hi, We used to (long ago, 2.2.x), whenever we got a write request for some buffer, search the buffer cache to see if additional buffers which belong to that particular stripe are dirty, and then schedule them for writing as well, in an attempt to write full stripes. That resulted in a huge sequential write performance improvement. If such an approach is still possible today, it is preferrable to delaying the writes It is not still possible, at least not usefully so. Infact, it is also true that it is probably not preferrable. Since about 2.3.7, filesystem data has not, by-and-large, been stored in the buffer cache. It is only stored in the page cache. So were raid5 to go looking in the buffer cache it would be unlikely to find anything. But there are other problems. The cache snooping only works if the direct client of raid5 is a filesystem that stores data in the buffer cache. If the filesystem is an indirect client, via LVM for example, or even via a RAID0 array, then raid5 would not be able to look in the right buffer cache, and so would find nothing. This was the case in 2.2. If you tried an LVM over RAID5 in 2.2, you wouldn't get good write speed. You also would probably get data corruption while the array was re-syncing, but that is a separate issue. The current solution is much more robust. It cares nothing about the way the raid5 array is used. Also, while the handling of stripes is delayed, I don't believe that this would actually show as measurable increase in latency. The effect is really to have requests spend more time on a higher level queue, and less time on a lower level queue. The total time on queues should normally be the same or less (due to improved throughput) or only very slightly more in pathological cases. NeilBrown for the partial buffer while hoping that the rest of the bufferes in the stripe would come as well, since it both eliminates the additional delay, and doesn't depend on the order in which the bufferes are flushed from the much bigger memory buffers to the smaller stripe cache. I think the ideal solution would be to have the filesystem write data in two stages, much like Unix apps can. As soon as a buffer is dirtied (or more accurately, as soon as the filesystem is happy for the data to be written), it is passed on with a WRITE_AHEAD request. The driver is free to do what it likes, including ignore this. Later, at a time corresponding to fsync or close maybe, or when memory is tight, the filesystem can send the buffer down with a WRITE request which says please write this *now*. RAID5 could then gather all the write_ahead requests into a hash table (not unlike the old buffer cache), and easily find full stripes for writing. But that is not going to happen in 2.4. NeilBrown Cheers, Gadi - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
Re: Failed disk triggers raid5.c bug?
On Sunday June 24, [EMAIL PROTECTED] wrote: Hi! Neil Brown wrote: On Thursday June 14, [EMAIL PROTECTED] wrote: Dear All I've just had a disk (sdc) fail in my raid5 array (sdb sdc sdd), Great! A real live hardware failure1 It is always more satisfying to watch one of those than to have to simulate them all the time!! Unless of course they are fatal... not the case here it seems. Well, here comes a _real_ fatal one... And a very detailed report it was, thanks. I'm not sure that you want to know this, but it looks like you might have been able to recover your data though it is only a might. Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target0/lun0/part4's sb offset: 16860096 [events: 0024] Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target1/lun0/part4's sb offset: 16860096 [events: 0024] Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target2/lun0/part4's sb offset: 16860096 [events: 0023] Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target3/lun0/part4's sb offset: 16860096 [events: 0028] The reason that this array couldn't restart was that the 4th drive had the highest event count and it was alone in this. It didn't even have any valid data!! Had you unplugged this drive and booted, it would have tried to assemble an array out of the first two (event count 24). This might have worked (though it might not, read on). Alternately, you could have created a raidtab which said that the third drive was failed, and then run mkraid... mdctl, when it is finished, should be able to make this all much easier. But what went wrong? I don't know the whole story but: - On the first error, the drive was disabled and reconstruction was started. - On the second error, the reconstruction was inapprporiately interrupted. This is an error that I will have to fix in 2.4. However it isn't really fatal error. - Things were then going fairly OK, though noisy, until: Jun 19 09:10:07 wien kernel: attempt to access beyond end of device Jun 19 09:10:07 wien kernel: 08:04: rw=0, want=1788379804, limit=16860217 Jun 19 09:10:07 wien kernel: dev 09:00 blksize=1024 blocknr=447094950 sector=-718207696 size=4096 count=1 For some reason, it tried to access well beyond the end of one of the underlying drives. This caused that drive to fail. This relates to the subsequent message: Jun 19 09:10:07 wien kernel: raid5: restarting stripe 3576759600 which strongly suggests that the filesystem actually asked the raid5 array for a block that was well out of range. In 2.4, this will be caught before the request gets to raid5. In 2.2 it isn't. The request goes on to raid5, raid5 blindly passes a bad request down to the disc. The disc reports an error, and raid5 thinks the disc has failed, rather than realise that it never should have made such a silly request. But why did the filesystem ask for a block that was out of range? This is the part that I cannot fathom. It would seem as though the filesystem got corrupt somehow. Maybe an indirect block got replaced with garbage, and ext2fs believed the indirect block and went seeking way off the end of the array. But I don't know how the corruption happened. Had you known enough to restart the array from the two apparently working drives, and then run fsck, it might have fixed things enough to keep going. Or it might not, depending on how much corruption there was. So, Summary of problems: 1/ md responds to a failure on a know-failed drive inappropriately. This shouldn't be fatal but needs fixing. 2/ md isn't thoughtful enough about updating the event counter on superblocks and can easily leave an array in an unbuildable state. This needs to be fix. It's on my list... 3/ raid5 responds to a request for an out-of-bounds device address by passing on out-of-bounds device addresses the drives, and then thinking that those drives are failed. This is fixed in 2.4 4/ Something caused some sort of filesystem corruption. I don't know what. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
PATCH - md initialisation to accept devfs names
Linus, it is possible to start an md array from the boot command line with, e.g. md=0,/dev/something,/dev/somethingelse However only names recognised by name_to_kdev_t work here. devfs based names do not work. To fix this, the follow patch moves the name lookup from __setup time to __init time so that the devfs routines can be called. This patch is largely due to Dave Cinege, though I have made a few improvements (particularly removing the devices array from md_setup_args). The #ifdef MODULE that this patch removes it wholy within another #ifdef MODULE and so it totally pointless. NeilBrown --- ./drivers/md/md.c 2001/06/21 00:51:42 1.2 +++ ./drivers/md/md.c 2001/06/21 00:53:09 1.3 @@ -3638,7 +3638,7 @@ char device_set [MAX_MD_DEVS]; int pers[MAX_MD_DEVS]; int chunk[MAX_MD_DEVS]; - kdev_t devices[MAX_MD_DEVS][MD_SB_DISKS]; + char *device_names[MAX_MD_DEVS]; } md_setup_args md__initdata; /* @@ -3657,14 +3657,15 @@ * md=n,device-list reads a RAID superblock from the devices * elements in device-list are read by name_to_kdev_t so can be * a hex number or something like /dev/hda1 /dev/sdb + * 2001-06-03: Dave Cinege [EMAIL PROTECTED] + * Shifted name_to_kdev_t() and related operations to md_set_drive() + * for later execution. Rewrote section to make devfs compatible. */ -#ifndef MODULE -extern kdev_t name_to_kdev_t(char *line) md__init; static int md__init md_setup(char *str) { - int minor, level, factor, fault, i=0; - kdev_t device; - char *devnames, *pername = ; + int minor, level, factor, fault; + char *pername = ; + char *str1 = str; if (get_option(str, minor) != 2) {/* MD Number */ printk(md: Too few arguments supplied to md=.\n); @@ -3673,9 +3674,8 @@ if (minor = MAX_MD_DEVS) { printk (md: Minor device number too high.\n); return 0; - } else if (md_setup_args.device_set[minor]) { - printk (md: Warning - md=%d,... has been specified twice;\n - will discard the first definition.\n, minor); + } else if (md_setup_args.device_names[minor]) { + printk (md: md=%d, Specified more then once. Replacing previous +definition.\n, minor); } switch (get_option(str, level)) { /* RAID Personality */ case 2: /* could be 0 or -1.. */ @@ -3706,53 +3706,72 @@ } /* FALL THROUGH */ case 1: /* the first device is numeric */ - md_setup_args.devices[minor][i++] = level; + str = str1; /* FALL THROUGH */ case 0: md_setup_args.pers[minor] = 0; pername=super-block; } - devnames = str; - for (; iMD_SB_DISKS str; i++) { - if ((device = name_to_kdev_t(str))) { - md_setup_args.devices[minor][i] = device; - } else { - printk (md: Unknown device name, %s.\n, str); - return 0; - } - if ((str = strchr(str, ',')) != NULL) - str++; - } - if (!i) { - printk (md: No devices specified for md%d?\n, minor); - return 0; - } - + printk (md: Will configure md%d (%s) from %s, below.\n, - minor, pername, devnames); - md_setup_args.devices[minor][i] = (kdev_t) 0; - md_setup_args.device_set[minor] = 1; + minor, pername, str); + md_setup_args.device_names[minor] = str; + return 1; } -#endif /* !MODULE */ +extern kdev_t name_to_kdev_t(char *line) md__init; void md__init md_setup_drive(void) { int minor, i; kdev_t dev; mddev_t*mddev; + kdev_t devices[MD_SB_DISKS+1]; for (minor = 0; minor MAX_MD_DEVS; minor++) { + int err = 0; + char *devname; mdu_disk_info_t dinfo; - int err = 0; - if (!md_setup_args.device_set[minor]) + if ((devname = md_setup_args.device_names[minor]) == 0) continue; + + for (i = 0; i MD_SB_DISKS devname != 0; i++) { + + char *p; + void *handle; + + if ((p = strchr(devname, ',')) != NULL) + *p++ = 0; + + dev = name_to_kdev_t(devname); + handle = devfs_find_handle(NULL, devname, MAJOR (dev), MINOR +(dev), + DEVFS_SPECIAL_BLK, 1); + if (handle != 0) { + unsigned major, minor; + devfs_get_maj_min(handle, major, minor); +
PATCH - tag all printk's in md.c
Linus, This patch makes sure that all the printks in md.c print a message starting with md: or md%d:. The next step (not today) will be to reduce a lot of them to KERN_INFO or similar as md is really quite noisy. Also, two printks in raid1.c get prefixed with raid1: This patch is partly due to Dave Cinege. While preparing this I noticed that write_disk_sb sometimes returns 1 for error, sometimes -1, and the return val is added into a cumulative error variable. So now it always returns 1. Also md_update_sb reports on the each superblock (one per disk) on separate lines, but worries about inserting commas and ending with a full stop. I have removed the commas and fullstop - vestigates of shorter device names I suspect. NeilBrown --- ./drivers/md/md.c 2001/06/21 00:53:09 1.3 +++ ./drivers/md/md.c 2001/06/21 00:53:39 1.4 @@ -634,7 +634,7 @@ md_list_add(rdev-same_set, mddev-disks); rdev-mddev = mddev; mddev-nb_dev++; - printk(bind%s,%d\n, partition_name(rdev-dev), mddev-nb_dev); + printk(md: bind%s,%d\n, partition_name(rdev-dev), mddev-nb_dev); } static void unbind_rdev_from_array (mdk_rdev_t * rdev) @@ -646,7 +646,7 @@ md_list_del(rdev-same_set); MD_INIT_LIST_HEAD(rdev-same_set); rdev-mddev-nb_dev--; - printk(unbind%s,%d\n, partition_name(rdev-dev), + printk(md: unbind%s,%d\n, partition_name(rdev-dev), rdev-mddev-nb_dev); rdev-mddev = NULL; } @@ -686,7 +686,7 @@ static void export_rdev (mdk_rdev_t * rdev) { - printk(export_rdev(%s)\n,partition_name(rdev-dev)); + printk(md: export_rdev(%s)\n,partition_name(rdev-dev)); if (rdev-mddev) MD_BUG(); unlock_rdev(rdev); @@ -694,7 +694,7 @@ md_list_del(rdev-all); MD_INIT_LIST_HEAD(rdev-all); if (rdev-pending.next != rdev-pending) { - printk((%s was pending)\n,partition_name(rdev-dev)); + printk(md: (%s was pending)\n,partition_name(rdev-dev)); md_list_del(rdev-pending); MD_INIT_LIST_HEAD(rdev-pending); } @@ -777,14 +777,14 @@ { int i; - printk( SB: (V:%d.%d.%d) ID:%08x.%08x.%08x.%08x CT:%08x\n, + printk(md: SB: (V:%d.%d.%d) ID:%08x.%08x.%08x.%08x CT:%08x\n, sb-major_version, sb-minor_version, sb-patch_version, sb-set_uuid0, sb-set_uuid1, sb-set_uuid2, sb-set_uuid3, sb-ctime); - printk( L%d S%08d ND:%d RD:%d md%d LO:%d CS:%d\n, sb-level, + printk(md: L%d S%08d ND:%d RD:%d md%d LO:%d CS:%d\n, sb-level, sb-size, sb-nr_disks, sb-raid_disks, sb-md_minor, sb-layout, sb-chunk_size); - printk( UT:%08x ST:%d AD:%d WD:%d FD:%d SD:%d CSUM:%08x E:%08lx\n, + printk(md: UT:%08x ST:%d AD:%d WD:%d FD:%d SD:%d CSUM:%08x E:%08lx\n, sb-utime, sb-state, sb-active_disks, sb-working_disks, sb-failed_disks, sb-spare_disks, sb-sb_csum, (unsigned long)sb-events_lo); @@ -793,24 +793,24 @@ mdp_disk_t *desc; desc = sb-disks + i; - printk( D %2d: , i); + printk(md: D %2d: , i); print_desc(desc); } - printk( THIS: ); + printk(md: THIS: ); print_desc(sb-this_disk); } static void print_rdev(mdk_rdev_t *rdev) { - printk( rdev %s: O:%s, SZ:%08ld F:%d DN:%d , + printk(md: rdev %s: O:%s, SZ:%08ld F:%d DN:%d , partition_name(rdev-dev), partition_name(rdev-old_dev), rdev-size, rdev-faulty, rdev-desc_nr); if (rdev-sb) { - printk(rdev superblock:\n); + printk(md: rdev superblock:\n); print_sb(rdev-sb); } else - printk(no rdev superblock!\n); + printk(md: no rdev superblock!\n); } void md_print_devices (void) @@ -820,9 +820,9 @@ mddev_t *mddev; printk(\n); - printk(**\n); - printk(* COMPLETE RAID STATE PRINTOUT *\n); - printk(**\n); + printk(md: **\n); + printk(md: * COMPLETE RAID STATE PRINTOUT *\n); + printk(md: **\n); ITERATE_MDDEV(mddev,tmp) { printk(md%d: , mdidx(mddev)); @@ -838,7 +838,7 @@ ITERATE_RDEV(mddev,rdev,tmp2) print_rdev(rdev); } - printk(**\n); + printk(md: **\n); printk(\n); } @@ -917,15 +917,15 @@ if (!rdev-sb) { MD_BUG(); - return -1; + return 1; } if (rdev-faulty) {
PATCH - raid5 performance improvement - 3 of 3
Linus, and fellow RAIDers, This is the third in my three patch series for improving RAID5 throughput. This one substantially lifts write thoughput by leveraging the opportunities for write gathering provided by the first patch. With RAID5, it is much more efficient to write a whole stripe full of data at a time as this avoids the need to pre-read any old data or parity from the discs. Without this patch, when a write request arrives, raid5 will immediately start a couple of pre-reads so that it will be able to write that block and update the parity. By the time that the old data and parity arrive it is quite possible that write requests for all the other blocks in the stripe will have been submitted, and the old data and parity will not be needed. This patch uses concepts similar to queue plugging to delay write requests slightly to improve the chance that many or even all of the data blocks in a stripe will have outstanding write requests before processing is started. To do this it maintains a queue of stripes that seem to require pre-reading. Stripes are only released from this queue when there are no other pre-read requests active, and then only if the raid5 device is not currently plugged. As I mentioned earlier, my testing shows substantial improvements from these three patches for both sequential (bonnie) and random (dbench) access patterns. I would be particularly interested if anyone else does any different testing, preferable comparing 2.2.19+patches with 2.4.5 and then with 2.4.5-plus these patches. I know of one problem area being sequential writes to a 3 disc array. If anyone can find any other access patterns that still perform below 2.2.19 levels, I would really like to know about them. NeilBrown --- ./include/linux/raid/raid5.h2001/06/21 01:01:46 1.3 +++ ./include/linux/raid/raid5.h2001/06/21 01:04:05 1.4 @@ -158,6 +158,32 @@ #define STRIPE_HANDLE 2 #defineSTRIPE_SYNCING 3 #defineSTRIPE_INSYNC 4 +#defineSTRIPE_PREREAD_ACTIVE 5 +#defineSTRIPE_DELAYED 6 + +/* + * Plugging: + * + * To improve write throughput, we need to delay the handling of some + * stripes until there has been a chance that several write requests + * for the one stripe have all been collected. + * In particular, any write request that would require pre-reading + * is put on a delayed queue until there are no stripes currently + * in a pre-read phase. Further, if the delayed queue is empty when + * a stripe is put on it then we plug the queue and do not process it + * until an unplg call is made. (the tq_disk list is run). + * + * When preread is initiated on a stripe, we set PREREAD_ACTIVE and add + * it to the count of prereading stripes. + * When write is initiated, or the stripe refcnt == 0 (just in case) we + * clear the PREREAD_ACTIVE flag and decrement the count + * Whenever the delayed queue is empty and the device is not plugged, we + * move any strips from delayed to handle and clear the DELAYED flag and set +PREREAD_ACTIVE. + * In stripe_handle, if we find pre-reading is necessary, we do it if + * PREREAD_ACTIVE is set, else we set DELAYED which will send it to the delayed queue. + * HANDLE gets cleared if stripe_handle leave nothing locked. + */ + struct disk_info { kdev_t dev; @@ -182,6 +208,8 @@ int max_nr_stripes; struct list_headhandle_list; /* stripes needing handling */ + struct list_headdelayed_list; /* stripes that have plugged requests */ + atomic_tpreread_active_stripes; /* stripes with scheduled io */ /* * Free stripes pool */ @@ -192,6 +220,9 @@ * waiting for 25% to be free */ md_spinlock_t device_lock; + + int plugged; + struct tq_structplug_tq; }; typedef struct raid5_private_data raid5_conf_t; --- ./drivers/md/raid5.c2001/06/21 01:01:46 1.3 +++ ./drivers/md/raid5.c2001/06/21 01:04:05 1.4 @@ -31,6 +31,7 @@ */ #define NR_STRIPES 256 +#defineIO_THRESHOLD1 #define HASH_PAGES 1 #define HASH_PAGES_ORDER 0 #define NR_HASH(HASH_PAGES * PAGE_SIZE / sizeof(struct stripe_head *)) @@ -65,11 +66,17 @@ BUG(); if (atomic_read(conf-active_stripes)==0) BUG(); - if (test_bit(STRIPE_HANDLE, sh-state)) { + if (test_bit(STRIPE_DELAYED, sh-state)) + list_add_tail(sh-lru, conf-delayed_list); + else if (test_bit(STRIPE_HANDLE, sh-state)) { list_add_tail(sh-lru, conf-handle_list);
PATCH
Linus, There is a buggy BUG in the raid5 code. If a request on an underlying device reports an error, raid5 finds out which device that was and marks it as failed. This is fine. If another request on the same device reports an error, raid5 fails to find that device in its table (because though it is there, it is not operational), and so it thinks something is wrong and calls MD_BUG() - which is very noisy, though not actually harmful (except to the confidence of the sysadmin) This patch changes the test so that a failure on a drive that is known but not-operational will be Expected and node a BUG. NeilBrown --- ./drivers/md/raid5.c2001/06/21 01:04:05 1.4 +++ ./drivers/md/raid5.c2001/06/21 01:04:41 1.5 @@ -486,22 +486,24 @@ PRINTK(raid5_error called\n); conf-resync_parity = 0; for (i = 0, disk = conf-disks; i conf-raid_disks; i++, disk++) { - if (disk-dev == dev disk-operational) { - disk-operational = 0; - mark_disk_faulty(sb-disks+disk-number); - mark_disk_nonsync(sb-disks+disk-number); - mark_disk_inactive(sb-disks+disk-number); - sb-active_disks--; - sb-working_disks--; - sb-failed_disks++; - mddev-sb_dirty = 1; - conf-working_disks--; - conf-failed_disks++; - md_wakeup_thread(conf-thread); - printk (KERN_ALERT - raid5: Disk failure on %s, disabling device. -Operation continuing on %d devices\n, - partition_name (dev), conf-working_disks); + if (disk-dev == dev) { + if (disk-operational) { + disk-operational = 0; + mark_disk_faulty(sb-disks+disk-number); + mark_disk_nonsync(sb-disks+disk-number); + mark_disk_inactive(sb-disks+disk-number); + sb-active_disks--; + sb-working_disks--; + sb-failed_disks++; + mddev-sb_dirty = 1; + conf-working_disks--; + conf-failed_disks++; + md_wakeup_thread(conf-thread); + printk (KERN_ALERT + raid5: Disk failure on %s, disabling device. +Operation continuing on %d devices\n, + partition_name (dev), conf-working_disks); + } return 0; } } - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
Re: du discrepancies?
On Friday June 15, [EMAIL PROTECTED] wrote: There appears to be a discrepancy between the true state of affairs on my RAID partitions and what df reports; [root /]# sfdisk -l /dev/hda Disk /dev/hda: 38792 cylinders, 16 heads, 63 sectors/track Units = cylinders of 516096 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/hda1 0+ 15231524- 768095+ fd Linux raid autodetect /dev/hda2 15241845 3221622885 Extended /dev/hda3 18462252 407205128 fd Linux raid autodetect /dev/hda4 2253 38791 36539 18415656 fd Linux raid autodetect /dev/hda5 1524+ 1584 61-30743+ 83 Linux /dev/hda6 1585+ 1845 261- 131543+ 82 Linux swap [root /]# df Filesystem 1k-blocks Used Available Use% Mounted on /dev/md1755920666748 50772 93% / WRONG /dev/md3198313 13405174656 7% /var WRONG /dev/md4 18126088118288 17087024 1% /home WRONG These figures are clearly wrong. Can anyone suggest where I should start looking for an explanation? How can figures be wrong? They are just figures. What do you think is wrong about them?? Anyway, for a more useful response... I assume that md[134] are RAID1 arrays, with one mirror on hda. Lets take md1 made in part of hda1 hda1 has 768095 1K blocks. md/raid rounds down to a multiple of 64K, and then removes the last 64k for the raid super block, leaving 768000 1K blocks. ext2fs uses some of this for metad, and reports the rest as the available space. The overhead space compises the superblocks, the block group descriptors, and inode bitmaps, the block bitmaps, and the inode tables. This seems to add up to 12080K on this filesystem, about 1.6%. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
Re: mdctl - names and code
Thankyou for all the suggestions for names for mdctl. We have raidctl raidctrl swraidctl mdtools mdutils mdmanage mdmgr mdmd:-) mdcfg mdconfig mdadmin Mike Black suggested that it is valuable for tools that are related to start with a common prefix so that command completion can be used to find them. I think that is very true but, in this case, irrelevant. mdctl (to whatever it gets called) will be one tool that does everything. I might arrange - for back compatability - that if you call it with a name like raidhotadd it will default to the hot-add functionality, but I don't expect that to be normal usage. I have previously said that I am not very fond of raidctl as raid is a bit too generic. swraidctl is better but harder to pronouce. I actually rather like md. It has a pleasing similarity to mt. Also man 1 md -- would document the user interface man 5 md -- would document the device driver. This elegant. But maybe a bit too terse. I'm currently leaning towards mdadm or mdadmin as it is easy to pronounce (for my palate anyway) and has the right sort of meaning. I think I will keep that name at mdctl until I achieve all the functionality I want, and then when I release v1.0 I will change the name to what seems best at the time. Thanks again for the suggestions and interest. For the eager, http://www.cse.unsw.edu.au/~neilb/source/mdctl/mdctl-v0.3.tgz contains my latest source which has most of the functionality in place, though it doesn't all work quite right yet. You can create a raid5 array with: mdctl --create /dev/md0 --level raid5 --chunk 32 --raid-disks 3 \ /dev/sda /dev/sdb /dev/sdc and stop it with mdctl --stop /dev/md0 and the assemble it with mdctl --assemble /dev/md0 /dev/sda /dev/sdb /dev/sdc I wouldn't actually trust a raid array that you build this way though. Some fields in the super block aren't right yet. I am very interested in comments on the interface and usability. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
Re: mdctl
On Friday June 8, [EMAIL PROTECTED] wrote: On Fri, 8 Jun 2001, Neil Brown wrote: If you don't like the name mdctl (I don't), please suggest another. How about raidctrl? Possible... though I don't think it is much better. Longer to type too:-) I kind of like having the md in there as it is the md driver. raid is a generic term, and mdctl doesn't work with all raid (i.e. not firmware raid), only software raid, and in particular, only the md driver. But thanks for the suggestion, I will keep it in mind and see if it grows on me. NeilBrown -- MfG / Regards Friedrich Lobenstock - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
Re: failure of raid 5 when first disk is unavailable
On Thursday June 7, [EMAIL PROTECTED] wrote: Hi Neil; I am hoping you are going to tell me this is already solved, but here goes... Almost :-) scenario: hda4, hdb4, and hdc4 in a raid 5 with no hotspare. With 2.4.3 XFS kernels, it seems that a raid 5 does not come up correctly if the first disk is unavailable. The error message that arises in the syslog from the md driver is: md: could not lock hda4, zero size? marking faulty md: could not import hda4. md: autostart hda4 failed! Yep. This happens when you use raidstart. It doesn't happen if you set the partition type to LINUX_RAID and use the autodetect functionality. raidstart just takes one drive, gives it to the kernel, and say look in the superblock for the major/minor of the other devices. This has several failure modes. It is partly for this reason that I am writting a replacement md management tool - mdctl. I wasn't going to announce it just yet because it is very incomplete, but you have pushed me into it :-) http://www.cse.unsw.edu.au/~neilb/source/mdctl/mdctl-v0.2.tgz is a current snapshot. I compiles (for me) and mdctl --help works. mdctl --Assemble is nearly there. Comments welcome. In 2.2, there is no other way to start an array than give one device to the kernel and tell it to look for others. So mdctl will find the device numbers of the devices in the array and re-write the super block if necessary to make the array start. In 2.4, mdctl can use a SET_ARRAY_INFO / ADD_NEW_DISK* / RUN_ARRAY sequence to start a new array. If you don't like the name mdctl (I don't), please suggest another. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
Re: mdrecoveryd invalid operand error
On Wednesday June 6, [EMAIL PROTECTED] wrote: In the XFS kernel tree v2.4.3 w/ several patches, we were unable to raidhotremove and subsequently raidhotadd a spare without a reboot. It did not matter if you had a new or the same hard disk. We then tried the patch Igno Molnar sent regarding the issue. http://www.mail-archive.com/linux-raid@vger.kernel.org/msg00551.html This solved the problem of not doing a reboot and trying to switch a hotspare and faulty drive. In addition, however, we are seeing a kernel panic using raid 5 when mdrecoveryd starts when after doing the hotspare and faulty drive swap a second time without a reboot. ... Do you have any suggestions Neil? Yep. Upgrade to the latest kernel! (Don't you just hate it when that turns out to be the answer). Ingo's patch is half right, but not quite there. http://www.cse.unsw.edu.au/~neilb/patches/linux/2.4.5-pre2/patch-A-rdevfix contains a correct version of the patch. It is in 2.4.5. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
Re: md= problems with devfs names
On Saturday June 2, [EMAIL PROTECTED] wrote: I've moved from: md=4,/dev/sdf5,/dev/sdg5 to: md=4,/dev/scsi/host0/bus0/target30/lun0/part5,\ /dev/scsi/host0/bus0/target32/lun0/part5 And now get: md: Unknown device name,\ /dev/scsi/host0/bus0/target30/lun0/part5,\ /dev/scsi/host0/bus0/target32/lun0/part5. : ( md_setup() is displaying the error due to failing on name_to_kdev_t(). root_dev_setup() calls name_to_kdev_t() with a long devfs name without a problem, so that's not the issue directly. Yes... this is all very ugly. root_dev_setup also stores the device name in root_device_name. And then when actualy mounting root in fs/super.c::mount_root, devfs_find_handle is called to map that name into a devfs object. So maybe md_setup should store names as well, and md_setup_drive should call devfs_find_handle like mount_root does. But probably sticking with non-devfs names is easier. Was there a particular need to change to devfs naming? NeilBrown I think md_setup() is being run before the devfs names are fully registered, but i have no clue how the execution order of __setup() items is determined. Help? Dave md_setup() is run VERY early, much earlier then raid_setup(). dmesg excerpt: mapped APIC to e000 (fee0) mapped IOAPIC to d000 (fec0) Kernel command line: devfs=mount raid=noautodetect root=/dev/scsi/host0/bus0/target2/lun0/part7 md=4,/dev/scsi/host0/bus0/target30/lun0/part5,/dev/scsi/host0/bus0/target32/lun0/part5 mem=393216K md: Unknown device name, /dev/scsi/host0/bus0/target30/lun0/part5,/dev/scsi/host0/bus0/target32/lun0/part5. Initializing CPU#0 Detected 875.429 MHz processor. Console: colour VGA+ 80x25 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
[PATCH] raid1 to use sector numbers in b_blocknr
Linus, raid1 allocates a new buffer_head when passing a request done to an underlying device. It currently sets b_blocknr to b_rsector/(b_size9) from the original buffer_head to parallel other uses of b_blocknr (i.e. it being the number of the block). However, if raid1 gets a non-aligned request, then the calcuation of b_blocknr would loose information resulting in potential data curruption if the request were resubmitted to a different drive on failure. Non aligned requests aren't currently possible (I believe) but newer filesystems are ikely to want them soon, and if a raid1 array were to be partitioned into partitions that were not page aligned, it could happen. This patch changes the usage of b_blocknr in raid1.c to store the value of b_rsector of the incoming request. Also, I remove the third argument to raid1_map which is never used. NeilBrown --- ./drivers/md/raid1.c2001/05/23 01:18:15 1.1 +++ ./drivers/md/raid1.c2001/05/23 01:18:19 1.2 @@ -298,7 +298,7 @@ md_spin_unlock_irq(conf-device_lock); } -static int raid1_map (mddev_t *mddev, kdev_t *rdev, unsigned long size) +static int raid1_map (mddev_t *mddev, kdev_t *rdev) { raid1_conf_t *conf = mddev_to_conf(mddev); int i, disks = MD_SB_DISKS; @@ -602,7 +602,7 @@ bh_req = r1_bh-bh_req; memcpy(bh_req, bh, sizeof(*bh)); - bh_req-b_blocknr = bh-b_rsector / sectors; + bh_req-b_blocknr = bh-b_rsector; bh_req-b_dev = mirror-dev; bh_req-b_rdev = mirror-dev; /* bh_req-b_rsector = bh-n_rsector; */ @@ -646,7 +646,7 @@ /* * prepare mirrored mbh (fields ordered for max mem throughput): */ - mbh-b_blocknr= bh-b_rsector / sectors; + mbh-b_blocknr= bh-b_rsector; mbh-b_dev= conf-mirrors[i].dev; mbh-b_rdev = conf-mirrors[i].dev; mbh-b_rsector= bh-b_rsector; @@ -1138,7 +1138,6 @@ int disks = MD_SB_DISKS; struct buffer_head *bhl, *mbh; raid1_conf_t *conf; - int sectors = bh-b_size 9; conf = mddev_to_conf(mddev); bhl = raid1_alloc_bh(conf, conf-raid_disks); /* don't really need this many */ @@ -1168,7 +1167,7 @@ mbh-b_blocknr= bh-b_blocknr; mbh-b_dev= conf-mirrors[i].dev; mbh-b_rdev = conf-mirrors[i].dev; - mbh-b_rsector= bh-b_blocknr * sectors; + mbh-b_rsector= bh-b_blocknr; mbh-b_state = (1BH_Req) | (1BH_Dirty) | (1BH_Mapped) | (1BH_Lock); atomic_set(mbh-b_count, 1); @@ -1195,7 +1194,7 @@ } } else { dev = bh-b_dev; - raid1_map (mddev, bh-b_dev, bh-b_size 9); + raid1_map (mddev, bh-b_dev); if (bh-b_dev == dev) { printk (IO_ERROR, partition_name(bh-b_dev), bh-b_blocknr); md_done_sync(mddev, bh-b_size9, 0); @@ -1203,6 +1202,7 @@ printk (REDIRECT_SECTOR, partition_name(bh-b_dev), bh-b_blocknr); bh-b_rdev = bh-b_dev; + bh-b_rsector = bh-b_blocknr; generic_make_request(READ, bh); } } @@ -1211,8 +1211,7 @@ case READ: case READA: dev = bh-b_dev; - - raid1_map (mddev, bh-b_dev, bh-b_size 9); + raid1_map (mddev, bh-b_dev); if (bh-b_dev == dev) { printk (IO_ERROR, partition_name(bh-b_dev), bh-b_blocknr); raid1_end_bh_io(r1_bh, 0); @@ -1220,6 +1219,7 @@ printk (REDIRECT_SECTOR, partition_name(bh-b_dev), bh-b_blocknr); bh-b_rdev = bh-b_dev; + bh-b_rsector = bh-b_blocknr; generic_make_request (r1_bh-cmd, bh); } break; @@ -1313,6 +1313,7 @@ struct
[PATCH] raid resync by sectors to allow for 512byte block filesystems
Linus, The current raid1/raid5 resync code requests resync in units of 1k (though the raid personality can round up requests if it likes). This interacts badly with filesystems that do IO in 512 byte blocks, such as XFS (because raid5 need to use the same blocksize for IO and resync). The attached patch changes the resync code to work in units of sectors which makes more sense and plays nicely with XFS. NeilBrown --- ./drivers/md/md.c 2001/05/17 05:50:51 1.1 +++ ./drivers/md/md.c 2001/05/17 06:11:50 1.2 @@ -2997,7 +2997,7 @@ int sz = 0; unsigned long max_blocks, resync, res, dt, db, rt; - resync = mddev-curr_resync - atomic_read(mddev-recovery_active); + resync = (mddev-curr_resync - atomic_read(mddev-recovery_active))/2; max_blocks = mddev-sb-size; /* @@ -3042,7 +3042,7 @@ */ dt = ((jiffies - mddev-resync_mark) / HZ); if (!dt) dt++; - db = resync - mddev-resync_mark_cnt; + db = resync - (mddev-resync_mark_cnt/2); rt = (dt * ((max_blocks-resync) / (db/100+1)))/100; sz += sprintf(page + sz, finish=%lu.%lumin, rt / 60, (rt % 60)/6); @@ -3217,7 +3217,7 @@ void md_done_sync(mddev_t *mddev, int blocks, int ok) { - /* another blocks (1K) blocks have been synced */ + /* another blocks (512byte) blocks have been synced */ atomic_sub(blocks, mddev-recovery_active); wake_up(mddev-recovery_wait); if (!ok) { @@ -3230,7 +3230,7 @@ int md_do_sync(mddev_t *mddev, mdp_disk_t *spare) { mddev_t *mddev2; - unsigned int max_blocks, currspeed, + unsigned int max_sectors, currspeed, j, window, err, serialize; kdev_t read_disk = mddev_to_kdev(mddev); unsigned long mark[SYNC_MARKS]; @@ -3267,7 +3267,7 @@ mddev-curr_resync = 1; - max_blocks = mddev-sb-size; + max_sectors = mddev-sb-size1; printk(KERN_INFO md: syncing RAID array md%d\n, mdidx(mddev)); printk(KERN_INFO md: minimum _guaranteed_ reconstruction speed: %d KB/sec/disc.\n, @@ -3291,23 +3291,23 @@ /* * Tune reconstruction: */ - window = MAX_READAHEAD*(PAGE_SIZE/1024); - printk(KERN_INFO md: using %dk window, over a total of %d blocks.\n,window,max_blocks); + window = MAX_READAHEAD*(PAGE_SIZE/512); + printk(KERN_INFO md: using %dk window, over a total of %d +blocks.\n,window/2,max_sectors/2); atomic_set(mddev-recovery_active, 0); init_waitqueue_head(mddev-recovery_wait); last_check = 0; - for (j = 0; j max_blocks;) { - int blocks; + for (j = 0; j max_sectors;) { + int sectors; - blocks = mddev-pers-sync_request(mddev, j); + sectors = mddev-pers-sync_request(mddev, j); - if (blocks 0) { - err = blocks; + if (sectors 0) { + err = sectors; goto out; } - atomic_add(blocks, mddev-recovery_active); - j += blocks; + atomic_add(sectors, mddev-recovery_active); + j += sectors; mddev-curr_resync = j; if (last_check + window j) @@ -3325,7 +3325,7 @@ mark_cnt[next] = j - atomic_read(mddev-recovery_active); last_mark = next; } - + if (md_signal_pending(current)) { /* @@ -3350,7 +3350,7 @@ if (md_need_resched(current)) schedule(); - currspeed = (j-mddev-resync_mark_cnt)/((jiffies-mddev-resync_mark)/HZ +1) +1; + currspeed = +(j-mddev-resync_mark_cnt)/2/((jiffies-mddev-resync_mark)/HZ +1) +1; if (currspeed sysctl_speed_limit_min) { current-nice = 19; --- ./drivers/md/raid5.c2001/05/17 05:50:51 1.1 +++ ./drivers/md/raid5.c2001/05/17 06:11:51 1.2 @@ -886,7 +886,7 @@ } } if (syncing) { - md_done_sync(conf-mddev, (sh-size10) - sh-sync_redone,0); + md_done_sync(conf-mddev, (sh-size9) - sh-sync_redone,0); clear_bit(STRIPE_SYNCING, sh-state); syncing = 0; } @@ -1059,7 +1059,7 @@ } } if (syncing locked == 0 test_bit(STRIPE_INSYNC, sh-state)) { - md_done_sync(conf-mddev, (sh-size10) - sh-sync_redone,1); + md_done_sync(conf-mddev, (sh-size9) - sh-sync_redone,1); clear_bit(STRIPE_SYNCING, sh-state); } @@ -1153,13 +1153,13 @@ return correct_size; } -static int raid5_sync_request (mddev_t *mddev, unsigned long block_nr) +static int
[PATCH] RAID5 NULL Checking Bug Fix
On Wednesday May 16, [EMAIL PROTECTED] wrote: (more patches to come. They will go to Linus, Alan, and linux-raid only). This is the next one, which actually addresses the NULL Checking Bug. There are two places in the the raid code which allocate memory without (properly) checking for failure and which are fixed in ac, both in raid5.c. One in grow_buffers and one in __check_consistency. The one in grow_buffers is definately right and included below in a slightly different form - fewer characters. The one in __check_consistency is best fixed by simply removing __check_consistency. __check_consistency reads stipes at some offset into the array and checks that parity is correct. This is called once, but the result is ignored. Presumably this is a hangover from times gone by when the superblock didn't have proper versioning and there was no auto-rebuild process. It is now irrelevant and can go. There is similar code in raid1.c which should also go. This patch removes said code. NeilBrown --- ./drivers/md/raid5.c2001/05/16 05:14:39 1.2 +++ ./drivers/md/raid5.c2001/05/16 05:27:20 1.3 @@ -156,9 +156,9 @@ return 1; memset(bh, 0, sizeof (struct buffer_head)); init_waitqueue_head(bh-b_wait); - page = alloc_page(priority); - bh-b_data = page_address(page); - if (!bh-b_data) { + if ((page = alloc_page(priority))) + bh-b_data = page_address(page); + else { kfree(bh); return 1; } @@ -1256,76 +1256,6 @@ printk(raid5: resync finished.\n); } -static int __check_consistency (mddev_t *mddev, int row) -{ - raid5_conf_t *conf = mddev-private; - kdev_t dev; - struct buffer_head *bh[MD_SB_DISKS], *tmp = NULL; - int i, ret = 0, nr = 0, count; - struct buffer_head *bh_ptr[MAX_XOR_BLOCKS]; - - if (conf-working_disks != conf-raid_disks) - goto out; - tmp = kmalloc(sizeof(*tmp), GFP_KERNEL); - tmp-b_size = 4096; - tmp-b_page = alloc_page(GFP_KERNEL); - tmp-b_data = page_address(tmp-b_page); - if (!tmp-b_data) - goto out; - md_clear_page(tmp-b_data); - memset(bh, 0, MD_SB_DISKS * sizeof(struct buffer_head *)); - for (i = 0; i conf-raid_disks; i++) { - dev = conf-disks[i].dev; - set_blocksize(dev, 4096); - bh[i] = bread(dev, row / 4, 4096); - if (!bh[i]) - break; - nr++; - } - if (nr == conf-raid_disks) { - bh_ptr[0] = tmp; - count = 1; - for (i = 1; i nr; i++) { - bh_ptr[count++] = bh[i]; - if (count == MAX_XOR_BLOCKS) { - xor_block(count, bh_ptr[0]); - count = 1; - } - } - if (count != 1) { - xor_block(count, bh_ptr[0]); - } - if (memcmp(tmp-b_data, bh[0]-b_data, 4096)) - ret = 1; - } - for (i = 0; i conf-raid_disks; i++) { - dev = conf-disks[i].dev; - if (bh[i]) { - bforget(bh[i]); - bh[i] = NULL; - } - fsync_dev(dev); - invalidate_buffers(dev); - } - free_page((unsigned long) tmp-b_data); -out: - if (tmp) - kfree(tmp); - return ret; -} - -static int check_consistency (mddev_t *mddev) -{ - if (__check_consistency(mddev, 0)) -/* - * We are not checking this currently, as it's legitimate to have - * an inconsistent array, at creation time. - */ - return 0; - - return 0; -} - static int raid5_run (mddev_t *mddev) { raid5_conf_t *conf; @@ -1483,12 +1413,6 @@ if (conf-working_disks != sb-raid_disks) { printk(KERN_ALERT raid5: md%d, not all disks are operational -- trying to recover array\n, mdidx(mddev)); start_recovery = 1; - } - - if (!start_recovery (sb-state (1 MD_SB_CLEAN)) - check_consistency(mddev)) { - printk(KERN_ERR raid5: detected raid-5 superblock xor inconsistency -- running resync\n); - sb-state = ~(1 MD_SB_CLEAN); } { --- ./drivers/md/raid1.c2001/05/16 05:14:39 1.2 +++ ./drivers/md/raid1.c2001/05/16 05:27:20 1.3 @@ -1448,69 +1448,6 @@ } } -/* - * This will catch the scenario in which one of the mirrors was - * mounted as a normal device rather than as a part of a raid set. - * - * check_consistency is very personality-dependent, eg. RAID5 cannot - * do this check, it uses another method. - */ -static int __check_consistency
[PATCH] - md_error gets simpler
Linus, This isn't a bug fix, just a tidy up. Current, md_error - which is called when an underlying device detects an error - takes a kdev_t to identify which md array is affected. It converts this into a mddev_t structure pointer, and in every case, the caller already has the desired structure pointer. This patch changes md_error and the callers to pass an mddev_t instead of a kdev_t NeilBrown --- ./include/linux/raid/md.h 2001/05/16 06:08:41 1.1 +++ ./include/linux/raid/md.h 2001/05/16 06:10:02 1.2 @@ -80,7 +80,7 @@ extern struct gendisk * find_gendisk (kdev_t dev); extern int md_notify_reboot(struct notifier_block *this, unsigned long code, void *x); -extern int md_error (kdev_t mddev, kdev_t rdev); +extern int md_error (mddev_t *mddev, kdev_t rdev); extern int md_run_setup(void); extern void md_print_devices (void); --- ./drivers/md/raid5.c2001/05/16 05:27:20 1.3 +++ ./drivers/md/raid5.c2001/05/16 06:10:02 1.4 @@ -412,7 +412,7 @@ spin_lock_irqsave(conf-device_lock, flags); } } else { - md_error(mddev_to_kdev(conf-mddev), bh-b_dev); + md_error(conf-mddev, bh-b_dev); clear_bit(BH_Uptodate, bh-b_state); } clear_bit(BH_Lock, bh-b_state); @@ -440,7 +440,7 @@ md_spin_lock_irqsave(conf-device_lock, flags); if (!uptodate) - md_error(mddev_to_kdev(conf-mddev), bh-b_dev); + md_error(conf-mddev, bh-b_dev); clear_bit(BH_Lock, bh-b_state); set_bit(STRIPE_HANDLE, sh-state); __release_stripe(conf, sh); --- ./drivers/md/md.c 2001/05/16 06:08:41 1.1 +++ ./drivers/md/md.c 2001/05/16 06:10:03 1.2 @@ -2464,7 +2464,7 @@ int ret; fsync_dev(mddev_to_kdev(mddev)); - ret = md_error(mddev_to_kdev(mddev), dev); + ret = md_error(mddev, dev); return ret; } @@ -2938,13 +2938,11 @@ } -int md_error (kdev_t dev, kdev_t rdev) +int md_error (mddev_t *mddev, kdev_t rdev) { - mddev_t *mddev; mdk_rdev_t * rrdev; int rc; - mddev = kdev_to_mddev(dev); /* printk(md_error dev:(%d:%d), rdev:(%d:%d), (caller: %p,%p,%p,%p).\n,MAJOR(dev),MINOR(dev),MAJOR(rdev),MINOR(rdev), __builtin_return_address(0),__builtin_return_address(1),__builtin_return_address(2),__builtin_return_address(3)); */ if (!mddev) { --- ./drivers/md/raid1.c2001/05/16 05:27:20 1.3 +++ ./drivers/md/raid1.c2001/05/16 06:10:03 1.4 @@ -388,7 +388,7 @@ * this branch is our 'one mirror IO has finished' event handler: */ if (!uptodate) - md_error (mddev_to_kdev(r1_bh-mddev), bh-b_dev); + md_error (r1_bh-mddev, bh-b_dev); else /* * Set R1BH_Uptodate in our master buffer_head, so that @@ -1426,7 +1426,7 @@ * We don't do much here, just schedule handling by raid1d */ if (!uptodate) - md_error (mddev_to_kdev(r1_bh-mddev), bh-b_dev); + md_error (r1_bh-mddev, bh-b_dev); else set_bit(R1BH_Uptodate, r1_bh-state); raid1_reschedule_retry(r1_bh); @@ -1437,7 +1437,7 @@ struct raid1_bh * r1_bh = (struct raid1_bh *)(bh-b_private); if (!uptodate) - md_error (mddev_to_kdev(r1_bh-mddev), bh-b_dev); + md_error (r1_bh-mddev, bh-b_dev); if (atomic_dec_and_test(r1_bh-remaining)) { mddev_t *mddev = r1_bh-mddev; unsigned long sect = bh-b_blocknr * (bh-b_size9); - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
Re: spare-disk in a RAID1 set? Conflicting answers...
On Saturday April 28, [EMAIL PROTECTED] wrote: Question: can you have one or more spare-disk entries in /etc/raidtab when running a RAID1 set? First answer: the Linux Software-RAID HOWTO says yes, and gives an example of this in the section on RAID1 config in raidtab. Second answer: the manpage for raidtab says no, stating that spare-disk is only valid for RAID4 and RAID5. Hm.. Which is it? You can certainly have spare discs in raid1 arrays. NeilBrown I'm running Mandrake 8.0, which is a 2.4.3 kernel. I haven't tried to actually use a spare-disk entry yet, because I'm still waiting for the third disk for my RAID1 set to get here, but I thought I'd ask to see if anybody knows for sure. If not, I'll experiment with it once my third disk gets here and report back. Thanks! - Al --- | voice: 503.247.9256 Lots of folks confuse bad management | email: [EMAIL PROTECTED] with destiny. | cell: 503.709.0028 | email to my cell: - Kin Hubbard | [EMAIL PROTECTED] --- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED]
Re: Strange performance results in RAID5
On Thursday March 29, [EMAIL PROTECTED] wrote: Hi, I have been doing some performance checks on my RAID 5 system. Good. The system is 5 Seagate Cheetahs X15 Linux 2.4.2 I am using IOtest 3.0 on /dev/md0 My chunk size is 1M... When I do random reads of 64K blobs using one process, I get 100 reads/sec, which is the same as doing random reads on one disk. So I was quite happy with that. My next test was to do random reads using ten processes, I expected 500 reads/sec, however, I only got 250 reads/sec. This to me doesn't seem right??? Does anyone know why this is the case? A few possibilities: 1/ you didn't say how fast your SCSI buss is. I guess if it is reasonably new it would be at least 80Mb/sec which should allow 500 * 64K/s but it wouldn't have to be too old to not allow that, and I don't like to assume things that aren't stated. 2/ You could be being slowed down by the stripe cache - it only allows 256 concurrent 4k access. Try increasing NR_STRIPES at the top of drivers/md/raid5.c - say to 2048. See if that makes a difference. 3/ Also, try applying http://www.cse.unsw.edu.au/~neilb/patches/linux/2.4.3-pre6/patch-F-raid5readbypass This patch speeds up large sequential reads, at a possible small cost to random read-modify-writes (I haven't measured any problems, but haven't had the time to explore the performance thoroughly). What it does is read directly into the filesystems buffer instead of into the stripe cache and then memcpy into the filesys buffer. 4/ I'm assuming you are doing direct IO to /dev/md0. Try making a mounting a filesystem of /dev/md0 first. This will switch the device blocksize to 4K (if you have a 4k block size filesystem). The larger block size improves performance substantially. I always do I/O tests to a filesystem, not to the block device, because it makes a difference and it is a filesystem that I want to use (though I realise that you may not). NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: Error Injector?
On Wednesday March 21, [EMAIL PROTECTED] wrote: My question is based upon prior experience working for Stratus Computer. At Stratus it was impractical to go beat the disk drives with a hammer to cause them to fail - rather we would simply use a utility to cause the disk driver to begin to get "errors" from the drives. This would then exercise the recovery mechanism - taking a drive off line and bringing another up to take its place. This facility is also present in Veritas Volume Manager test suites to exercise the code. raidsetfaulty should do what you want. It is part of the latest raidtools-0.90. If you don't have it, get the source from www.kernel.org/pub/linux/daemons/... (it might be devel rather than daemons, I'm not sure). NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: disk fails in raid5 but not in raid0
On Monday March 19, [EMAIL PROTECTED] wrote: Hi, I have a RAID setup, 3 Compaq 4Gb drives running of an Adaptec2940UW. Kernel 2.2.18 with RAID-patches etc. I have been trying out various options, doing some stress-testing etc., and I have now arrived at the following situation that I cannot explain: when running the 3 drives in a RAID5 config, one of the drives (always the same one) will always fail in during heavy IO or during a resync phase. It appears to produce one IO error (judging from messages in the log), upon which it is promptly removed from the array. I can then hotremove the failing drive, then hotadd it - and resync starts, and quite often completes. This scenario is consistently repeatable. During the initial resync phase, the data blocks are read and the parity blocks are written -- across all drives. During a rebuild-after-failure, data and paritiy are read from a "good" drives and data-or-parity is written to the spare drive. This could lead to differrent patterns of concurrent access. In particular, duing the resync that you say often completes, the questionable drive is only being written to. During the resync that usually fails, the questionable drive is often being read concurently with other drives. So, it would seem that this one drive has a hardware problem. So I ran badblocks with write-check on it, couple of times - came out 100% clean. I then built a RAID0 array instead - and started driving lots of IO on it - it's still running - not a problem. Filled up the array, still no probs. So, except when the drive is in a RAID5 config, it seems ok. Well, raid5 would do about 30% more IO when writing. It certainly sounds odd, but it could be some combinatorial thing.. Any suggestions ? I would like to confirm whether or whether not the drive has a problem. Try re-arranging the drives on the scsi chain. If the questionable one is currently furthest from the host-adapter, make it closest. See if that has any effect. It could well be cabling, or terminators or something. Or it could be the drive. NeilBrown thanks, Per Jessen regards, Per Jessen - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: Problem migrating RAID1 from 2.2.x to 2.4.2
On Monday March 19, [EMAIL PROTECTED] wrote: I'm having trouble running a RAID1 root/boot mirror under 2.4.2. Works fine on 2.2.14 though. I'm running RH 6.2 with stock 2.2.14 kernel. Running RAID1 on a pair of 9.1 UW SCSI Barracudas as root/boot/lilo. md0 is / and md1 is 256M swap, also a 2 drive mirror. I built the RAID1 at install time using the Redhat GUI. This configuration works flawlessly. However, I've recently compiled the 2.4.2 kernel, with no module support; RAID1 static. When 2.4.2 boots, I get an "Kernel panic: VFS: Unable to mount root fs on 09:00". Here's the RAID driver output when booting 2.4.2: autodetecting RAID arrays (read) sda5's sb offset: 8618752 [events: 0022] (read) sdb5's sb offset: 8618752 [events: 0022] autorun ... considering sdb5 ... adding sdb5 ... adding sda5 ... created md0 bindsda5,1 bindsdb5,2 running: sdb5sda5 now! sdb5's event counter: 0022 sda5's event counter: 0022 do_md_run() returned -22 md0 stopped. Note: This RAID1 mirror works great under 2.2.14. Under 2.4.2 I get the "returned -22" - What does this mean? -22 == EINVAL It looks very much like raid1 *isn't* compiled into your kernel. Can you show us your .config file? Also /proc/mdstat when booted under 2.2 might help. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: Proposed RAID5 design changes.
(I've taken Alan and Linus off the Cc list. I'm sure they have plenty to read, and may even be on linux-raid anyway). On Thursday March 15, [EMAIL PROTECTED] wrote: I'm not too happy with the linux RAID5 implementation. In my opinion, a number of changes need to be made, but I'm not sure how to make them or get them accepted into the official distribution if I did make the changes. I've been doing a fair bit of development with RAID5 lately and Linus seems happy to accept patches from me, and I am happy to work with you (or anyone else) to make improvements and them submit them to Linus. There was a paper in 2000 USENIX Annual Technical Conference titled Towards Availability Benchmarks: A Case Study of Software RAID Systems by Aaron Brown and David A. Patterson of UCB. They built a neat rig for testing fault tolerance and fault handling in raid systems and compared Linux, Solaris, and WinNT. Their particular comment about Linux was that it seemed to evict drives on any excuse, just as you observe. Apparently the other systems tried much harder to keep drives in the working set if possible. It is certainly worth a read if you are interested in this. My feeling about retrying after failed IO is that it should be done at a lower level. Once the SCSI or IDE level tells us that there is a READ error, or a WRITE error, we should believe them. Now it appears that this isn't true: at least not for all drivers. So while I would not be strongly against putting that sort of re-try logic at the RAID level, I think it would be worth the effort to find out why it isn't being done at a lower level. As for re-writing after a failed read, that certainly makes sense, and probably wouldn't be too hard. You would introduce into the "struct stripe_head" a way to mark a drive as "read-failed". Then on a read error, you mark that drive as read-failed in that stripe only and schedule a retry. If the retry succeeds, you then schedule a write, and if that works, you just continue on happily. You would need to make sure that you aren't too generous: once you have had some number of read errors on a given drive you really should fail that drive anyway. 3) Drives should not be kicked out of the array unless they are having really persistent problems. I've an idea on how to define 'really persistent' but it requires a bit of math to explain, so I'll only go into it if someone is interested. I'd certainly be interested in reading your math. Then there are two changes that might improve recovery performance: 4) If the drive being kicked out is not totally inoperable and there is a spare drive to replace it, try to copy the data from the failing drive to the spare rather than reconstructing the data from all the other disks. Fall back to full reconstruction if the error rate gets too high. That would actually be fairly easy to do. Once you get the data structures right so that the concept of a "partially failed" drive can be clearly represented, it should be a cinch. 5) When doing (4) use the SCSI 'copy' command if the drives are on the same bus, and the host adapter and driver supports 'copy'. However, this should be done with caution. 'copy' is not generally used and any number of undetected firmware bugs might make it unreliable. An additional category may need to be added to the device black list to flag devices that can not do 'copy' reliably. I've very against this sort of idea. Currently the raid code is blissfully unaware of the underlying technology: it could be scsi, ide, ramdisc, netdisk or anything else and RAID just doesn't care. This I believe is one of the strengths of software RAID - it is cleanly layered. Firmware (==hardware) raid controllers often try to "know" a lot about the underlying drive - even to the extent of getting the drives to do the XOR themselves I believe. This undoubtedly makes the code more complex, and can lead to real problems if you have firmware-mismatches (and we have had a few of those). Stick with "read" and "write" I think. Everybody understands what they mean so it is much more likely to work. And really, our rebuild performance isn't that bad. The other interesting result for Linux in that paper is that rebuild made almost no impact on performance with Linux, while it did for solaris and NT (but Linux did rebuild much more slowly). So if you want to do this, dive right in and have a go. I am certainly happy to answer any questions, review any code, and forward anything that looks good to Linus. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: RaidHotAdd and reconstruction
On Sunday March 4, [EMAIL PROTECTED] wrote: Hi folks, I have a two-disk RAID 1 test array that I was playing with. I then decided to hot add a third disk using ``raidhotadd''. The disk was added to the array, but as far as I could see, the RAID software did not start a reconstruction of that newly added disk. A skimmed through the driver code a bit, and could not really locate the point where the reconstruction was initiated. Am I missing something? The third disk that you added became a hot spare. You cannot add an extra active drive to a RAID array without using mkraid. In you case, you could edit /etc/raidtab to list the third strive as a "failed-disk" instead of a "raid-disk", and set the "nr-taid-disks" to 3. Then run mkraid. It shouldn't destroy any data, but the raid system should automatically start building data onto the new drive. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: [lvm-devel] Re: partitions for RAID volumes?
On Monday February 26, [EMAIL PROTECTED] wrote: Actually, the LVM metadata is somewhat poorly layed out in this respect. The metadata is at the start of the device, and on occasion is not even sector aligned, AFAICS. Also, the PE structures, while large powers of 2 in size, are not guaranteed to be other than sector aligned. They aligned with the END of the partition/device, and not the start. I think under Linux, partitions are at least 1k multiples in size so the PEs will at least be 1k aligned... MD/RAID volumes are always a multiple of 64k. The amount of a single device will be rounded down using MD_NEW_SIZE_BLOCKS defined as: #define MD_RESERVED_BYTES (64 * 1024) #define MD_RESERVED_SECTORS (MD_RESERVED_BYTES / 512) #define MD_RESERVED_BLOCKS (MD_RESERVED_BYTES / BLOCK_SIZE) #define MD_NEW_SIZE_SECTORS(x) ((x ~(MD_RESERVED_SECTORS - 1)) - MD_RESERVED_SECTORS) #define MD_NEW_SIZE_BLOCKS(x) ((x ~(MD_RESERVED_BLOCKS - 1)) - MD_RESERVED_BLOCKS) And the whole array will be the sum of 1 or more of these sizes. So if each PE is indeed sized "from 8KB to 512MB in powers of 2 and unit KB", then all accesses should be properly aligned, so you shouldn't have any problems (and if you apply the patch and get no errors, then you will be doubly sure). I thought a bit more about the consequences of unaligned accesses and I think it is most significant when rebuilding parity. RAID5 assumes that two different stripes with different sector addresses will not overlap. If all accesses are properly aligned, then this will be true. Also if all accesses are misaligned by the same amount (e.g. 4K accesses at (4n+1)K offsets) then everything should work well too. However, raid5 resync always aligns accesses so if the current stripe-cache size were 4K, all sync accesses would be at (4n)K offsets. If there were (4n+1)K accesses happening at the same time, they would not synchronise with the resync accesses and you could get corruption. But it sounds like LVM is safe from this problem. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: Going from 'old' (kernel v2.2.x) to 'new' (kernel v2.4.x) raidsystem
On February 26, [EMAIL PROTECTED] wrote: I'm currently running a standard v2.2.17 kernel w/ the 'accompanying' raid system (linear). Having the following /etc/mdtab file: /dev/md0 linear,4k,0,75f3bcd8/dev/sdb1 /dev/sdc1 /dev/sdd10 /dev/sde1 /dev/sdf1 /dev/sdg1 And converted this to a newer /etc/raidtab: raiddev /dev/md0 raid-level linear nr-raid-disks 6 persistent-superblock 1 "old" style linear arrays don't have a super block, so this should read: persistent-superblock 0 Given this, you should be able to run mkraid with complete safety as is doesn't actually write anything to any drive. You might be more comfortable running "raid0run" instead of "mkraid". It is the same program, but when called as raid0run, it checks that your configuration matches an old style raid0/linear setup. device /dev/sdb1 raid-disk 0 device /dev/sdc1 raid-disk 1 device /dev/sdd10 raid-disk 2 device /dev/sde1 raid-disk 3 device /dev/sdf1 raid-disk 4 device /dev/sdg1 raid-disk 5 And this is what cfdisk tells me about the partitions: sdb1Primary Linux raid autodetect 6310.74 sdc1Primary Linux raid autodetect 1059.07 sdd10 Logical Linux raid autodetect 2549.84 sde1Primary Linux raid autodetect 9138.29 sdf1Primary Linux raid autodetect18350.60 sdg1Primary Linux raid autodetect16697.32 Autodetect cannot work with old-style arrays that don't have superblocks. If you want autodetect, you will need to copy the data somewhere, create a new array, and copy it all back. NeilBrown But when I start the new kernel, it won't start the raidsystem... I tried the 'mkraid --upgrade' command, but that says something about no superblock stuff... Can't remember exactly what it says, but... And I'm to afraid to just execute 'mkraid' to create the array. I have over 50Gb of data that I can't backup somewhere... What can I do to keep the old data, but convert the array to the new raid system? -- Turbo __ _ Debian GNU Unix _IS_ user friendly - it's just ^/ /(_)_ __ _ ___ __ selective about who its friends are / / | | '_ \| | | \ \/ / Debian Certified Linux Developer _ /// / /__| | | | | |_| |Turbo Fredriksson [EMAIL PROTECTED] \\\/ \/_|_| |_|\__,_/_/\_\ Stockholm/Sweden -- Iran domestic disruption Treasury Panama assassination cracking genetic Albanian jihad president Noriega AK-47 Khaddafi ammonium DES [See http://www.aclu.org/echelonwatch/index.html for more about this] - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: Urgent Problem: moving a raid
On Sunday February 25, [EMAIL PROTECTED] wrote: Neil Brown wrote: OK, this time I really want to know how this should be handled. Well. it "should" be handled by re-writing various bits of raid code to make it all work more easily, but without doing that it "could" be handled by marking the partitions as hold raid componenets (0XFE I think) and then booting a kernel with AUTODETECT_RAID enabled. This approach ignores the device info in the superblock and finds everything properly. I do not use partitions, the whole /dev/hdb and /dev/hdd are the RAID drives (mainly because fdisk was unhappy handling the 60GB drives). Is there a way to do the above marking in this situation? How? No, without partitions, that idea falls through. With 2.4, you could boot with md=1,/dev/whatever,/dev/otherone and it should build the array from the named drives. There are ioctls available to do the same thing from user space, but no user-level code has been released to use it yet. The following patch, when applied to raidtools-0.90 should make raidstart do the right thing, but I it is a while since I wrote this code and I only did minimal testing. NeilBrown --- ./raidlib.c 2000/05/19 03:42:47 1.1 +++ ./raidlib.c 2000/05/19 06:53:04 @@ -149,6 +149,24 @@ return 0; } +static int do_newstart (int fd, md_cfg_entry_t *cfg) +{ + int i; + if (ioctl (fd, SET_ARRAY_INFO, 0UL)== -1) + return -1; + /* Ok, we have a new enough kernel (2.3.99pre9?) */ + for (i=0; icfg-array.param.nr_disks ; i++) { + struct stat s; + md_disk_info_t info; + stat(cfg-device_name[i], s); + memset(info, 0, sizeof(info)); + info.major = major(s.st_rdev); + info.minor = minor(s.st_rdev); + ioctl(fd, ADD_NEW_DISK, (unsigned long)info); + } + return (ioctl(fd, RUN_ARRAY, 0UL)!= 0); +} + int do_raidstart_rw (int fd, char *dev) { int func = RESTART_ARRAY_RW; @@ -380,10 +398,12 @@ { struct stat s; - stat (cfg-device_name[0], s); - fd = open_or_die(cfg-md_name); - if (do_mdstart (fd, cfg-md_name, s.st_rdev)) rc++; + if (do_newstart (fd, cfg)) { + stat (cfg-device_name[0], s); + + if (do_mdstart (fd, cfg-md_name, s.st_rdev)) rc++; + } break; } - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: MD reverting to old Raid type
On Sunday February 25, [EMAIL PROTECTED] wrote: Linux 2.4.1/RAIDtools2 0.90 I have 4 ide disks which have identical partition layouts. RAID is working successfully, its even booting RAID1. I created a RAID5 set on a set of 4 partitions, which works OK. I then destroyed that set and updated it so it was 2x RAID0 partitions (so I can mirror them into a RAID10). The problem is when I raidstop, then raidstart either of the new RAID0 mds it reverts back to the RAID5 (I originally noticed it when I rebooted). snip Any idea? In the raidtab file where you describe the raid0 arrays, make sure to say: persistent-superblock = 1 (or whatever the correct systax is). The default is 0 (== no) for back compat I assume, and so your raid5 superblock doesn't get overwritten. NeilBrown Cheers, Suad -- - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: Newbie questions
On Wednesday February 21, [EMAIL PROTECTED] wrote: Hello, This is my first time playing with software raid so sorry if I sound dumb. What I have is a remote device that only has one hard drive. There is no ability for a second. Can I use the raidtools package to setup a raid-1 mirror on two partitions on the same drive? For example /dev/hda1 and /dev/hda2 consist of the raid 1 set, /dev/hda3 swap, and /dev/hda4 for the rest of the OS. I know raid is normally used with multiple drives, but is this possible? Yes, it is possible, but would it help? If your drive fails, then you loose the data anyway. I guess this would protect against bad sectors in one part of the drive, but my experience is that once you get a bad sector or two your drive isn't long for this world. Also, write speed would be appalling as the head would be zipping back and forth between the two partitions. However, the best answer is "try it", and see if it does what you want. P.S. Could you please respond to [EMAIL PROTECTED] I am not on the list and could not find any info on how to join. echo help | mail [EMAIL PROTECTED] should get you started. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: Status of raid.
On Friday February 9, [EMAIL PROTECTED] wrote: Greetings, I'm getting ready to put kernel 2.4.1 on my server at home. I have some questions about the status of RAID in 2.4.1. Sorry to be dense but I couldn't glean the answers to these questions from my search of the mailing list. 1. It appears that as of 2.4.1 RAID is finally part of the standard kernel. Is this correct? Yes, though you will need to wait for 2.4.2 if you want to compile md as a module. 2. Which raidtools package do I use and where can I get it? Or is it, too, enclosed with the kernel? The same ones you would use with patches 2.2. i.e. 0.90. 3. Does the RAID in 2.4.1 have the read-balancing patch? Yes, that patch is in. NeilBrown -- / C. R. (Charles) Oldham | NCA Commission on Accreditation and \ / Director of Technology | School Improvement \ / [EMAIL PROTECTED] | V:480-965-8703 F:480-965-9423\ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: mkraid problems
On Thursday January 4, [EMAIL PROTECTED] wrote: OK, I followed the instructions from linuxraid.org and it seems my kernel is now installed because now it doesn't boot. Error message states that the root fs cannot be mounted on 08:03 then it halts. What did I miss now? Chris 08:03 is /dev/sda3. Is that what you would expect to be booting off - the first scsi disc? Is the scsi controller driver compiled into the kernel properly? Do the boot messages show that the scsi controller was found? NeilBrown --- Alvin Oga [EMAIL PROTECTED] wrote: hi ya chris... did you follow the instructions on www.linuxraid.org ??? the generic raid patch to generic 2.2.18 fails to patch properly.. try to follow the steps at linuxraid.org that was a very good site... have fun alvin http://www.linux-1U.net ... 1U Raid5 On Wed, 3 Jan 2001, Chris Winczewski wrote: Here is /etc/raidtab and /proc/mdstat chris raidtab: # Sample raid-5 configuration raiddev /dev/md0 raid-level5 nr-raid-disks 3 chunk-size32 # Parity placement algorithm #parity-algorithm left-asymmetric parity-algorithm left-symmetric #parity-algorithm right-asymmetric #parity-algorithm right-symmetric # Spare disks for hot reconstruction nr-spare-disks0 persistent-superblock 1 device/dev/sdb1 raid-disk 0 device/dev/sdc1 raid-disk 1 device/dev/sdd1 raid-disk 2 mdstat: Personalities : [4 raid5] read_ahead not set md0 : inactive md1 : inactive md2 : inactive md3 : inactive --- Neil Brown [EMAIL PROTECTED] wrote: On Wednesday January 3, [EMAIL PROTECTED] wrote: mkraid aborts with no usefull error mssg on screen or in the syslog. My /etc/raidtab is set up correctly and I am using raidtools2 with kernel 2.2.18 with raid patch installed. Any ideas? Chris Please include a copy of /etc/raidtab /proc/mdstat NeilBrown __ Do You Yahoo!? Yahoo! Photos - Share your holiday photos online! http://photos.yahoo.com/ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Photos - Share your holiday photos online! http://photos.yahoo.com/ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
PATCH - raid5 in 2.4.0-test13 - substantial rewrite with substantial performance increase
Linus here is a rather large patch for raid5 in 2.4.0-test13-pre3. It is a substantial re-write of the stripe-cache handling code, which is the heart of the raid5 module. I have been sitting on this for a while so that others can test it (a few have) and so that I would have had quite a lot of experience with it myself. I am now happy that it is ready for inclusion in 2.4.0-testX. I hope you will be too. What it does: - processing of raid5 requests can require several stages (pre-read, calc parity, write). To accomodate this, request handling is based on a simple state machine. Prior to this patch, state was explicitly recorded - there were different phases "READY", "READOLD", "WRITE" etc. With this patch, the state is made implicit in the b_state of the buffers in the cache. This makes the processing code (handle_stripe) more flexable, and it is easier to add requests to a stripe at any stage of processing. - With the above change, we no longer need to wait for a stripe to "complete" before adding a new request. We at most need to wait for a spinlock to be released. This allows more parallelism and provides throughput speeds many times the current speed. - Without this patch, two buffers are allocated on each stripe in the cache for each underlying device in the array. This is wasteful. With the patch, only one buffer is needed per stripe, per device. - This patch creates a couple of linked lists of stripes, one for stripes that are inactive, and one for stripe that need to be processed by raid5d. This obviates the need to search the hash table for the stripes of interested in raid5d or when looking for a free stripe. There is more work to be done to bring raid5 performance upto the level of 2.2+mingos-patches, but this is a first, large, step on the way. NeilBrown (2000 line patch included in mail to Linus, but removed from mail to linux-raid. If you want it, try http://www.cse.unsw.edu.au/~neilb/patch/linux/2.4.0-test13-pre3/patch-A-raid5 ) - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
PATCH - raid1 next drive selection.
Linus (et al) The raid1 code has a concept of finding a "next available drive". It uses this for read balancing. Currently, this is implemented via a simple linked list that links the working drives together. However, there is no locking to make sure that the list does not get modified by two processors at the same time, and hence corrupted (though it is changed so rarely that that is unlikely). Also, when choosing a drive to read from for rebuilding a new spare, the "last_used" drive is used, even though that might be a drive which failed recently. i.e. there is no check the the "last_used" drive is actually still valid. I managed to exploit this to get the kernel into a tight spin. This patch discards the linked list and just walks the array ignoring failed drives. I also makes sure that "last_used" is always validated before being used. patch against 2.4.0-test12-pre8 NeilBrown --- ./include/linux/raid/raid1.h2000/12/10 22:38:20 1.1 +++ ./include/linux/raid/raid1.h2000/12/10 22:38:25 1.2 @@ -7,7 +7,6 @@ int number; int raid_disk; kdev_t dev; - int next; int sect_limit; int head_position; --- ./drivers/md/raid1.c2000/12/10 22:36:34 1.2 +++ ./drivers/md/raid1.c2000/12/10 22:38:25 1.3 @@ -463,16 +463,12 @@ if (conf-resync_mirrors) goto rb_out; - if (conf-working_disks 2) { - int i = 0; - - while( !conf-mirrors[new_disk].operational - (i MD_SB_DISKS) ) { - new_disk = conf-mirrors[new_disk].next; - i++; - } - - if (i = MD_SB_DISKS) { + + /* make sure that disk is operational */ + while( !conf-mirrors[new_disk].operational) { + if (new_disk = 0) new_disk = conf-raid_disks; + new_disk--; + if (new_disk == disk) { /* * This means no working disk was found * Nothing much to do, lets not change anything @@ -480,11 +476,13 @@ */ new_disk = conf-last_used; + + goto rb_out; } - - goto rb_out; } - + disk = new_disk; + /* now disk == new_disk == starting point for search */ + /* * Don't touch anything for sequential reads. */ @@ -501,16 +499,16 @@ if (conf-sect_count = conf-mirrors[new_disk].sect_limit) { conf-sect_count = 0; - - while( new_disk != conf-mirrors[new_disk].next ) { - if ((conf-mirrors[new_disk].write_only) || - (!conf-mirrors[new_disk].operational) ) - continue; - - new_disk = conf-mirrors[new_disk].next; - break; - } - + + do { + if (new_disk=0) + new_disk = conf-raid_disks; + new_disk--; + if (new_disk == disk) + break; + } while ((conf-mirrors[new_disk].write_only) || +(!conf-mirrors[new_disk].operational)); + goto rb_out; } @@ -519,8 +517,10 @@ /* Find the disk which is closest */ - while( conf-mirrors[disk].next != conf-last_used ) { - disk = conf-mirrors[disk].next; + do { + if (disk = 0) + disk = conf-raid_disks; + disk--; if ((conf-mirrors[disk].write_only) || (!conf-mirrors[disk].operational)) @@ -534,7 +534,7 @@ current_distance = new_distance; new_disk = disk; } - } + } while (disk != conf-last_used); rb_out: conf-mirrors[new_disk].head_position = this_sector + sectors; @@ -702,16 +702,6 @@ return sz; } -static void unlink_disk (raid1_conf_t *conf, int target) -{ - int disks = MD_SB_DISKS; - int i; - - for (i = 0; i disks; i++) - if (conf-mirrors[i].next == target) - conf-mirrors[i].next = conf-mirrors[target].next; -} - #define LAST_DISK KERN_ALERT \ "raid1: only one disk left and IO error.\n" @@ -735,7 +725,6 @@ mdp_super_t *sb = mddev-sb; mirror-operational = 0; - unlink_disk(conf, failed); mark_disk_faulty(sb-disks+mirror-number); mark_disk_nonsync(sb-disks+mirror-number);
PATCH - md device reference counting
Linus (et al), An md device need to know if it is in-use so that it doesn't allow raidstop while still mounted. Previously it did this by looking for a superblock on the device. This is a bit in-elegant and doesn't generalise. With this patch, it tracks opens and closes (get and release) and does not allow raidstop while there is any active access. This leaves open the possibility of syncing out the superblocks on the last close, which might happen in a later patch. One interesting gotcha in this patch is that the START_ARRAY ioctl (used by raidstart) can potentially start a completely different array, as it decides which array to start based on a value in the raid superblock. To get the reference counts right, I needed to tell the code which array I think I am starting. I it actually starts that one, it sets the initial reference count to 1, otherwise it sets it to 0. patch against 2.4.0-test12-pre8 NeilBrown --- ./include/linux/raid/md_k.h 2000/12/10 22:54:16 1.1 +++ ./include/linux/raid/md_k.h 2000/12/10 23:21:26 1.2 @@ -206,6 +206,7 @@ struct semaphorereconfig_sem; struct semaphorerecovery_sem; struct semaphoreresync_sem; + atomic_tactive; atomic_trecovery_active; /* blocks scheduled, but not written */ md_wait_queue_head_trecovery_wait; --- ./drivers/md/md.c 2000/12/10 22:37:18 1.3 +++ ./drivers/md/md.c 2000/12/10 23:21:26 1.4 @@ -203,6 +203,7 @@ init_MUTEX(mddev-resync_sem); MD_INIT_LIST_HEAD(mddev-disks); MD_INIT_LIST_HEAD(mddev-all_mddevs); + atomic_set(mddev-active, 0); /* * The 'base' mddev is the one with data NULL. @@ -1718,12 +1719,20 @@ #define STILL_MOUNTED KERN_WARNING \ "md: md%d still mounted.\n" +#defineSTILL_IN_USE \ +"md: md%d still in use.\n" static int do_md_stop (mddev_t * mddev, int ro) { int err = 0, resync_interrupted = 0; kdev_t dev = mddev_to_kdev(mddev); + if (atomic_read(mddev-active)1) { + printk(STILL_IN_USE, mdidx(mddev)); + OUT(-EBUSY); + } + + /* this shouldn't be needed as above would have fired */ if (!ro get_super(dev)) { printk (STILL_MOUNTED, mdidx(mddev)); OUT(-EBUSY); @@ -1859,8 +1868,10 @@ * the 'same_array' list. Then order this list based on superblock * update time (freshest comes first), kick out 'old' disks and * compare superblocks. If everything's fine then run it. + * + * If "unit" is allocated, then bump its reference count */ -static void autorun_devices (void) +static void autorun_devices (kdev_t countdev) { struct md_list_head candidates; struct md_list_head *tmp; @@ -1902,6 +1913,12 @@ continue; } mddev = alloc_mddev(md_kdev); + if (mddev == NULL) { + printk("md: cannot allocate memory for md drive.\n"); + break; + } + if (md_kdev == countdev) + atomic_inc(mddev-active); printk("created md%d\n", mdidx(mddev)); ITERATE_RDEV_GENERIC(candidates,pending,rdev,tmp) { bind_rdev_to_array(rdev, mddev); @@ -1945,7 +1962,7 @@ #define AUTORUNNING KERN_INFO \ "md: auto-running md%d.\n" -static int autostart_array (kdev_t startdev) +static int autostart_array (kdev_t startdev, kdev_t countdev) { int err = -EINVAL, i; mdp_super_t *sb = NULL; @@ -2002,7 +2019,7 @@ /* * possibly return codes */ - autorun_devices(); + autorun_devices(countdev); return 0; abort: @@ -2077,7 +2094,7 @@ md_list_add(rdev-pending, pending_raid_disks); } - autorun_devices(); + autorun_devices(-1); } dev_cnt = -1; /* make sure further calls to md_autodetect_dev are ignored */ @@ -2607,6 +2624,8 @@ err = -ENOMEM; goto abort; } + atomic_inc(mddev-active); + /* * alloc_mddev() should possibly self-lock. */ @@ -2640,7 +2659,7 @@ /* * possibly make it lock the array ... */ - err = autostart_array((kdev_t)arg); + err = autostart_array((kdev_t)arg, dev); if (err) { printk("autostart %s failed!\n", partition_name((kdev_t)arg)); @@ -2820,14 +2839,26 @@ static int md_open (struct inode *inode, struct file *file) { /*
linus
Linus (et al), The raid code wants to be the sole accessor of any devices are are combined into the array. i.e. it want to lock those devices agaist other use. It currently tried to do this bby creating an inode that appears to be associated with that device. This no longer has any useful effect (and I don't think it has for a while, though I haven't dug into the history). I have changed the lock_rdev code to simple do a blkdev_get when attaching the device, and a blkdev_put when releasing it. This atleast makes sure that if the device is in a module it wont be unloaded. Any further exclusive access control will needed to go into the blkdev_{get,put} routines at some stage I think. patch against 2.4.0-test12-pre8 NeilBrown --- ./include/linux/raid/md_k.h 2000/12/10 23:21:26 1.2 +++ ./include/linux/raid/md_k.h 2000/12/10 23:33:07 1.3 @@ -165,8 +165,7 @@ mddev_t *mddev; /* RAID array if running */ unsigned long last_events; /* IO event timestamp */ - struct inode *inode;/* Lock inode */ - struct file filp; /* Lock file */ + struct block_device *bdev; /* block device handle */ mdp_super_t *sb; unsigned long sb_offset; --- ./drivers/md/md.c 2000/12/10 23:21:26 1.4 +++ ./drivers/md/md.c 2000/12/10 23:33:08 1.5 @@ -657,32 +657,25 @@ static int lock_rdev (mdk_rdev_t *rdev) { int err = 0; + struct block_device *bdev; - /* -* First insert a dummy inode. -*/ - if (rdev-inode) - MD_BUG(); - rdev-inode = get_empty_inode(); - if (!rdev-inode) + bdev = bdget(rdev-dev); + if (bdev == NULL) return -ENOMEM; - /* -* we dont care about any other fields -*/ - rdev-inode-i_dev = rdev-inode-i_rdev = rdev-dev; - insert_inode_hash(rdev-inode); - - memset(rdev-filp, 0, sizeof(rdev-filp)); - rdev-filp.f_mode = 3; /* read write */ + err = blkdev_get(bdev, FMODE_READ|FMODE_WRITE, 0, BDEV_FILE); + if (!err) { + rdev-bdev = bdev; + } return err; } static void unlock_rdev (mdk_rdev_t *rdev) { - if (!rdev-inode) + if (!rdev-bdev) MD_BUG(); - iput(rdev-inode); - rdev-inode = NULL; + blkdev_put(rdev-bdev, BDEV_FILE); + bdput(rdev-bdev); + rdev-bdev = NULL; } static void export_rdev (mdk_rdev_t * rdev) @@ -1150,7 +1143,7 @@ abort_free: if (rdev-sb) { - if (rdev-inode) + if (rdev-bdev) unlock_rdev(rdev); free_disk_sb(rdev); } - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: [OOPS] raidsetfaulty - raidhotremove - raidhotadd
On Wednesday December 6, [EMAIL PROTECTED] wrote: Neil Brown wrote: Could you try this patch and see how it goes? Same result! Ok... must be something else... I tried again to reproduce it, and this time I succeeded. The problem happens when you try to access the last 128k of a raid1 array what have been reconstructed since the last reboot. The reconstruction creates a sliding 3-part window which is 3*128k wide. The leading pane ("pending") may have some outstanding I/O requests, but no new requests will be added. The middle pane ("ready") has no outstanding I/O requests, and gets no new I/O requests, but does get new rebuild requests. The trailing pain ("active") has outstanding rebuild requests, but no new I/O requests will be added. This window slides forward through the address space keeping IO and reconstruction quite separate. However, the reconstruction process finishes with "ready" still covering the tail end of the address space. "active" has fallen of the end, and "pending" has collapse down to an empty pain, but "ready" is still there. When rebuilding after an unclean shutdown, this gets cleaned up properly, but when rebuilding onto a spare, it doesn't. The attached patch, which can also be found at: http://www.cse.unsw.edu.au/~neilb/patches/linux/2.4.0-test12-pre6/patch-G-raid1rebuild fixed the problem. It should apply to any recent 2.4.0-test kernel. Please try it and confirm that it works. Thanks, NeilBrown --- ./drivers/md/raid1.c2000/12/06 22:34:27 1.3 +++ ./drivers/md/raid1.c2000/12/06 22:37:04 1.4 @@ -798,6 +798,32 @@ } } +static void close_sync(raid1_conf_t *conf) +{ + mddev_t *mddev = conf-mddev; + /* If reconstruction was interrupted, we need to close the "active" and +"pending" +* holes. +* we know that there are no active rebuild requests, os cnt_active == +cnt_ready ==0 +*/ + /* this is really needed when recovery stops too... */ + spin_lock_irq(conf-segment_lock); + conf-start_active = conf-start_pending; + conf-start_ready = conf-start_pending; + wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock); + conf-start_active =conf-start_ready = conf-start_pending = +conf-start_future; + conf-start_future = mddev-sb-size+1; + conf-cnt_pending = conf-cnt_future; + conf-cnt_future = 0; + conf-phase = conf-phase ^1; + wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock); + conf-start_active = conf-start_ready = conf-start_pending = +conf-start_future = 0; + conf-phase = 0; + conf-cnt_future = conf-cnt_done;; + conf-cnt_done = 0; + spin_unlock_irq(conf-segment_lock); + wake_up(conf-wait_done); +} + static int raid1_diskop(mddev_t *mddev, mdp_disk_t **d, int state) { int err = 0; @@ -910,6 +936,7 @@ * Deactivate a spare disk: */ case DISKOP_SPARE_INACTIVE: + close_sync(conf); sdisk = conf-mirrors + spare_disk; sdisk-operational = 0; sdisk-write_only = 0; @@ -922,7 +949,7 @@ * property) */ case DISKOP_SPARE_ACTIVE: - + close_sync(conf); sdisk = conf-mirrors + spare_disk; fdisk = conf-mirrors + failed_disk; @@ -1213,27 +1240,7 @@ conf-resync_mirrors = 0; } - /* If reconstruction was interrupted, we need to close the "active" and "pending" -* holes. -* we know that there are no active rebuild requests, os cnt_active == cnt_ready ==0 -*/ - /* this is really needed when recovery stops too... */ - spin_lock_irq(conf-segment_lock); - conf-start_active = conf-start_pending; - conf-start_ready = conf-start_pending; - wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock); - conf-start_active =conf-start_ready = conf-start_pending = conf-start_future; - conf-start_future = mddev-sb-size+1; - conf-cnt_pending = conf-cnt_future; - conf-cnt_future = 0; - conf-phase = conf-phase ^1; - wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock); - conf-start_active = conf-start_ready = conf-start_pending = conf-start_future = 0; - conf-phase = 0; - conf-cnt_future = conf-cnt_done;; - conf-cnt_done = 0; - spin_unlock_irq(conf-segment_lock); - wake_up(conf-wait_done); + close_sync(conf); up(mddev-recovery_sem); raid1_shrink_buffers(conf); - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: Ex2FS unable to read superblock
On Sunday December 3, [EMAIL PROTECTED] wrote: I'm new to the raid under linux world, and had a question. Sorry if several posts have been made by me previously, I had some trouble subscribing to the list... I successfully installed redhat 6.2 with raid 0 for two drives on a sun ultra 1. However i'm trying to rebuild the kernel, and thought i'd play with 2.4test11 since it has the raid code built in, but to no avail. while it will auto detect the raid drives fine from what i can tell, the kernel always panics with "EX2FS Unable Ro Read Superblock" any thoughts on what i might be doing wrong that is causing this error. sorry if this has been brought up before dave - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] Details might help. e.g. a copy of the boot logs when booting under 2.2.whatever and it working, and similar logs with it booting under 2.4.0test11 and it not working, though they may not be as easiy to get if your root filesystem doesn't come up. There were issues in the 2.2 raid code on Sparc hardware which may have been resolved by redhat, and may have been resolved differently in 2.4. Look for a line like: md.c: sizeof(mdp_super_t) = 4096 is the number 4096 in both cases? If not, then that is probably the problem. It is quite possible that raid in 2.4 on sparc is not upwards compatible with raid in redhat6.2 on sparc. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: autodetect question
On Friday December 1, [EMAIL PROTECTED] wrote: If I have all of MD as a module and autodetect raid enabled, do the MD drives that the machine has get detected and setup 1) at boot 2) at module load or 3) it doesn't 3. It doesn't. Rationale: by the time you are loading a module, you have enough user space running to do the auto-detect stuff in user space. The simple fact that no-one has written autodetect code for user space yet is not a kernel issue. I will when I need it, unless someone beats me to it. This is another question. Is it possible to change the code so that autodetect works when the whole disk is part of the raid instead of a partition under it? (ie: check drives that the kernel couldn't find a partition table on) http://bible.gospelcom.net/cgi-bin/bible?passage=1COR+10:2 1 Corinthians 10 23 "Everything is permissible"--but not everything is beneficial. "Everything is permissible"--but not everything is constructive. Yes you could, but I don't think that you should. If you want to boot off a raid array of whole-drives, then enable CONFIG_MD_BOOT and boot with md=0,/dev/hda,/dev/hdb or similar. If you want this for a non-boot drive, configure it from user-space. NeilBrown coming soon: partitioning for md devices: md=0,/dev/hda,/dev/hdb root=/dev/md0p1 swap=/dev/md0p2 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: FW: how to upgrade raid disks from redhat 4.1 to redhat 6.2
On Friday December 1, [EMAIL PROTECTED] wrote: -Original Message- From: Carly Yang [mailto:[EMAIL PROTECTED]] Sent: Friday, December 01, 2000 2:42 PM To: [EMAIL PROTECTED] Subject: how to upgrade raid disks from redhat 4.1 to redhat 6.2 Dear Gregory I have a system which run on redhat 4.1 with tow scsi hard disks making a RAID0 partiton. I add command in /etc/rc.d/rc.local as the following: /sbin/mdadd /dev/md0 /dev/sda1 /dev/sdb1 /sbin/mdrun -p0 /dev/md0 e2fsck -y /dev/md0 mount -t ext2 /dev/md0 /home So I can access the raid disk. Recently I upgrade the system to redhat 6.2, I made the raidtab in /etc/ as following: raiddev /dev/md0 raid-level0 nr-raid-disks2 persistent-superblock0 chunk-size8 device/dev/sda1 raid-disk0 device/dev/sdb1 raid-disk1 I run "mkraid --upgrade /dev/md0" to upgrade raid partion to new system. But it report error as : Cannot upgrade magic-less superblock on /dev/sda1 ... I think you want raid0run. Check the man page and see if it works for you. NeilBrown mkraid: aborted, see syslog and /proc/mdstat for potential clues. run "cat /proc/mdstat" get "personalities: read-aheas net set unused device: none I run "mkraid" in mandrake 7.1 and get the same result, I don't know how to make a raid partition upgrade now. Could tell how to do that? I read your Linux-RAID-FAQ, I think you can give me some good advice. Yours Sincerely Carl - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
PATCH - md - initialisation cleanup
Linus, here is a patch for test12 which cleans up the initialisation of raid personalities. I didn't include it in the previous raid init cleanup because I hadn't figured out the inner mysteries of link order completely. The linux-kbuild list helped there. This patch arranges that each personaility auto-initialised, and makes sure that all personaility are initialised before md.o gets initialised. An earlier form of this (which didn't get the Makefile quite right) went into test11ac??. NeilBrown --- ./drivers/md/Makefile 2000/11/29 05:46:23 1.3 +++ ./drivers/md/Makefile 2000/11/29 06:04:25 1.4 @@ -16,10 +16,13 @@ obj-n := obj- := -# NOTE: xor.o must link *before* md.o so that auto-detect -# of raid5 arrays works (and doesn't Oops). Fortunately -# they are both export-objs, so setting the order here -# works. +# Note: link order is important. All raid personalities +# and xor.o must come before md.o, as they each initialise +# themselves, and md.o may use the personalities when it +# auto-initialised. +# The use of MIX_OBJS allows link order to be maintained even +# though some are export-objs and some aren't. + obj-$(CONFIG_MD_LINEAR)+= linear.o obj-$(CONFIG_MD_RAID0) += raid0.o obj-$(CONFIG_MD_RAID1) += raid1.o @@ -28,10 +31,11 @@ obj-$(CONFIG_BLK_DEV_LVM) += lvm-mod.o # Translate to Rules.make lists. -O_OBJS := $(filter-out $(export-objs), $(obj-y)) -OX_OBJS:= $(filter $(export-objs), $(obj-y)) -M_OBJS := $(sort $(filter-out $(export-objs), $(obj-m))) -MX_OBJS:= $(sort $(filter $(export-objs), $(obj-m))) +active-objs:= $(sort $(obj-y) $(obj-m)) + +O_OBJS := $(obj-y) +M_OBJS := $(obj-m) +MIX_OBJS := $(filter $(export-objs), $(active-objs)) include $(TOPDIR)/Rules.make --- ./drivers/md/md.c 2000/11/29 04:55:47 1.4 +++ ./drivers/md/md.c 2000/11/29 06:04:25 1.5 @@ -3576,12 +3576,6 @@ create_proc_read_entry("mdstat", 0, NULL, md_status_read_proc, NULL); #endif } -void hsm_init (void); -void translucent_init (void); -void linear_init (void); -void raid0_init (void); -void raid1_init (void); -void raid5_init (void); int md__init md_init (void) { @@ -3617,18 +3611,6 @@ md_register_reboot_notifier(md_notifier); raid_table_header = register_sysctl_table(raid_root_table, 1); -#ifdef CONFIG_MD_LINEAR - linear_init (); -#endif -#ifdef CONFIG_MD_RAID0 - raid0_init (); -#endif -#ifdef CONFIG_MD_RAID1 - raid1_init (); -#endif -#ifdef CONFIG_MD_RAID5 - raid5_init (); -#endif md_geninit(); return (0); } --- ./drivers/md/raid5.c2000/11/29 04:16:29 1.2 +++ ./drivers/md/raid5.c2000/11/29 06:04:25 1.3 @@ -2352,19 +2352,16 @@ sync_request: raid5_sync_request }; -int raid5_init (void) +static int md__init raid5_init (void) { return register_md_personality (RAID5, raid5_personality); } -#ifdef MODULE -int init_module (void) -{ - return raid5_init(); -} - -void cleanup_module (void) +static void raid5_exit (void) { unregister_md_personality (RAID5); } -#endif + +module_init(raid5_init); +module_exit(raid5_exit); + --- ./drivers/md/linear.c 2000/11/29 05:45:04 1.1 +++ ./drivers/md/linear.c 2000/11/29 06:04:25 1.2 @@ -190,24 +190,16 @@ status: linear_status, }; -#ifndef MODULE - -void md__init linear_init (void) -{ - register_md_personality (LINEAR, linear_personality); -} - -#else - -int init_module (void) +static int md__init linear_init (void) { - return (register_md_personality (LINEAR, linear_personality)); + return register_md_personality (LINEAR, linear_personality); } -void cleanup_module (void) +static void linear_exit (void) { unregister_md_personality (LINEAR); } -#endif +module_init(linear_init); +module_exit(linear_exit); --- ./drivers/md/raid0.c2000/11/29 05:45:04 1.1 +++ ./drivers/md/raid0.c2000/11/29 06:04:25 1.2 @@ -333,24 +333,17 @@ status: raid0_status, }; -#ifndef MODULE - -void raid0_init (void) -{ - register_md_personality (RAID0, raid0_personality); -} - -#else - -int init_module (void) +static int md__init raid0_init (void) { - return (register_md_personality (RAID0, raid0_personality)); + return register_md_personality (RAID0, raid0_personality); } -void cleanup_module (void) +static void raid0_exit (void) { unregister_md_personality (RAID0); } -#endif +module_init(raid0_init); +module_exit(raid0_exit); + --- ./drivers/md/raid1.c2000/11/29 05:45:04 1.1 +++ ./drivers/md/raid1.c2000/11/29 06:04:25 1.2 @@ -1882,19 +1882,16 @@ sync_request: raid1_sync_request }; -int raid1_init (void) +static int md__init raid1_init (void) { return register_md_personality (RAID1,
Re: raid1 resync problem ? (fwd)
On Tuesday November 28, [EMAIL PROTECTED] wrote: Hi, I'm forwarding the message to you guys because I got no answer from Ingo Thanks I would suggest always CCing to [EMAIL PROTECTED] I have taken the liberty of Ccing this reply there. -- Forwarded message -- Date: Sat, 25 Nov 2000 14:21:28 -0200 (BRST) From: Marcelo Tosatti [EMAIL PROTECTED] To: Ingo Molnar [EMAIL PROTECTED] Subject: raid1 resync problem ? Hi Ingo, While reading raid1 code from 2.4 kernel, I've found the following part on raid1_make_request function: ... spin_lock_irq(conf-segment_lock); wait_event_lock_irq(conf-wait_done, -bh-b_rsector conf-start_active || - - bh-b_rsector = conf-start_future,- conf-segment_lock); if (bh-b_rsector conf-start_active) conf-cnt_done++; else { conf-cnt_future++; if (conf-phase) set_bit(R1BH_SyncPhase, r1_bh-state); } spin_unlock_irq(conf-segment_lock); ... If I understood correctly, bh-b_rsector is used to know if the sector number of the request being processed is not inside the resync range. In case it is, it sleeps waiting for the resync daemon. Otherwise, it can send the operation to the lower level block device(s). The problem is that the code does not check for the request length to know if the last sector of the request is smaller than conf-start_active. For example, if we have conf-start_active = 1000, a write request with 8 sectors and bh-b_rsector = 905 is allowed to be done. 3 blocks (1001, 1002 and 1003) of this request are inside the resync range. The reason is subtle, but this cannot happen. resync is always done in full pages. So (on intel) start_active will always be a multiple of 8. Also, b_size can be at most one page (i.e. 4096 == 8 sectors) and b_rsector will be aligned to a multiple of b_size. Given this, if rsector start_active, you can be certain that rsector+(b_size9) = start_active, so there isn't a problem and your change is not necessary. Adding a comment to the code to explain this subtlety might be sensible though... NeilBrown If haven't missed anything, we can easily fix it using the last sector (bh-b_rsector + (bh-b_size 9)) instead the first sector when comparing with conf-start_active. Waiting for your comments. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: raid1 resync problem ? (fwd)
On Tuesday November 28, [EMAIL PROTECTED] wrote: snip If I understood correctly, bh-b_rsector is used to know if the sector number of the request being processed is not inside the resync range. In case it is, it sleeps waiting for the resync daemon. Otherwise, it can send the operation to the lower level block device(s). The problem is that the code does not check for the request length to know if the last sector of the request is smaller than conf-start_active. For example, if we have conf-start_active = 1000, a write request with 8 sectors and bh-b_rsector = 905 is allowed to be done. 3 blocks (1001, 1002 and 1003) of this request are inside the resync range. The reason is subtle, but this cannot happen. resync is always done in full pages. So (on intel) start_active will always be a multiple of 8. Also, b_size can be at most one page (i.e. 4096 == 8 sectors) and b_rsector will be aligned to a multiple of b_size. Given this, if rsector start_active, you can be certain that rsector+(b_size9) = start_active, so there isn't a problem and your change is not necessary. Adding a comment to the code to explain this subtlety might be sensible though... This becomes a problem with kiobuf requests (I have a patch to make raid1 code kiobuf-aware). With kiobufes, its possible (right now) to have requests up to 64kb, so the current code is problematic. In that case, you change sounds quite reasonable and should be included in you patch to make raid1 kiobuf-aware. Is there a URL to this patch? Can I look at it? NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: we are finding that parity writes are half of all writes when writing 50mb files
On Tuesday November 28, [EMAIL PROTECTED] wrote: On Tue, Nov 28, 2000 at 10:50:06AM +1100, Neil Brown wrote: However, there is only one "unplug-all-devices"(*) call in the API that a reader or write can make. It is not possible to unplug a particular device, or better still, to unplug a particular request. This isn't totally true. When we run out of requests on a certain device and we must startup the I/O to release some of them, we unplug only the _single_ device and we don't unplug-all-devices anymore. So during writes to disk (without using O_SYNC that isn't supported by 2.4.0-test11 anyways :) you never unplug-all-devices, but you only unplug finegrined at the harddisk level. Thanks for these comments. They helped think more clearly about what was going on, and as a result I have raid5 working even faster still, though not quite as fast as I hope... The raid5 device has a "stripe cache" where the stripes play a similar role to the requests used in the elevator code (__make_request). i.e. they gather together buffer_heads for requests that are most efficiently processed at the same time. When I run out of stripes, I need to wait for one to become free, but first I need to unplug any underlying devices to make sure that something *will* become free soon. When I unplug those devices, I have to call "run_task_queue(tq_disk)" (because that is the only interface), and this unplugs the raid5 device too. This substantially reduces the effectiveness of the plugging that I had implemented. To get around this artifact that I unplug whenever I need a stripe, I changed the "get-free-stripe" code so that if it has to wait for a stripe, it waits for 16 stripes. This means that we should be able to get up to 16 stripes all plugged together. This has helped a lot and I am now getting dbench thoughputs on 4K and 8K chunk sizes (still waiting for the rest of the results) that are better than 2.2 ever gave me. It still isn't as good as I hoped: With 4K chunks the 3drive thoughput it significantly better than the 2drive throughput. With 8K it is now slightly less (instead of much less). But in 2.2, 3drives with 8K chunks is better than 2drives with 8K chunks. What I really want to be able to do when I need a stripe but don't have a free one is: 1/ unplug any underlying devices 2/ if 50% of my stripes are plugged, unplug this device. (or some mild variation of that). However with the current interface, I cannot. Still, I suspect I can squeeze a bit more out with the current interface, and it will be enough for 2.4. It will be fun working to make the interfaces just right for 2.5 NeilBrown That isn't true for reads of course, for reads it's the highlevel FS/VM layer that unplugs the queue and it only knows about the run_task_queue(tq_disk) but Jens has patches to fix it too :). I'll have to have a look... but not today. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
PATCH - md/Makefile - link order
Linus, A couple of versions of this patch went into Alan's tree, but weren't quite right. This one is minimal, but works. The problem is that the the tidy up of xor.o, it auto-initialises itself, instead of being called by raid.o, and so needs to be linked *before* md.o, as the initialiser for md.o may start up a raid5 device that needs xor. This patch simply puts xor before md. I would like to tidy this up further and have all the raid flavours auto-initialise, but there are issues that I have to clarify with the kbuild people before I do that. After compiling with this patch, % objdump -t vmlinux | grep initcall.init contains: c03345dc l O .initcall.init 0004 __initcall_calibrate_xor_block c03345e0 l O .initcall.init 0004 __initcall_md_init c03345e4 l O .initcall.init 0004 __initcall_md_run_setup in that order which convinces me that it gets the order right. NeilBrown --- ./drivers/md/Makefile 2000/11/29 03:46:13 1.1 +++ ./drivers/md/Makefile 2000/11/29 04:05:27 1.2 @@ -16,12 +16,16 @@ obj-n := obj- := -obj-$(CONFIG_BLK_DEV_MD) += md.o +# NOTE: xor.o must link *before* md.o so that auto-detect +# of raid5 arrays works (and doesn't Oops). Fortunately +# they are both export-objs, so setting the order here +# works. obj-$(CONFIG_MD_LINEAR)+= linear.o obj-$(CONFIG_MD_RAID0) += raid0.o obj-$(CONFIG_MD_RAID1) += raid1.o obj-$(CONFIG_MD_RAID5) += raid5.o xor.o +obj-$(CONFIG_BLK_DEV_MD) += md.o obj-$(CONFIG_BLK_DEV_LVM) += lvm-mod.o # Translate to Rules.make lists. O_OBJS := $(filter-out $(export-objs), $(obj-y)) - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
PATCH - raid5.c - bad calculation
Linus, I sent this patch to Alan a little while ago, but after ac4, so I don't know if it went into his tree. There is a bit of code at the front of raid5_sync_request which calculates which block is the parity block for a given stripe. However, to convert from a block number (1K units) to a sector number it does 2 instead of 1 or *2, which leads to the wrong results. This can lead to data corruption, hanging, or an Oops. This patch fixes it (and allows my raid5 testing to run happily to completion). NeilBrown --- ./drivers/md/raid5.c2000/11/29 04:15:54 1.1 +++ ./drivers/md/raid5.c2000/11/29 04:16:29 1.2 @@ -1516,8 +1516,8 @@ raid5_conf_t *conf = (raid5_conf_t *) mddev-private; struct stripe_head *sh; int sectors_per_chunk = conf-chunk_size 9; - unsigned long stripe = (block_nr2)/sectors_per_chunk; - int chunk_offset = (block_nr2) % sectors_per_chunk; + unsigned long stripe = (block_nr1)/sectors_per_chunk; + int chunk_offset = (block_nr1) % sectors_per_chunk; int dd_idx, pd_idx; unsigned long first_sector; int raid_disks = conf-raid_disks; - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
PATCH - md_boot - ifdef fix
Linus, The are currently two ways to get md/raid devices configured at boot time. AUTODETECT_RAID finds bits of raid arrays from partition types and automagically connected them together MD_BOOT allows bits of raid arrays to be explicitly described on the boot line. Currently, MD_BOOT is not effective unless AUTODETECT_RAID is also enabled as both are implemented by md_run_setup, and md_run_setup is only called ifdef CONFIG_AUTODETECT_RAID. This patch fixes this irregularity. NeilBrown (patch against test12-pre2, as were the previous few, but I forget to mention). --- ./drivers/md/md.c 2000/11/29 04:22:13 1.2 +++ ./drivers/md/md.c 2000/11/29 04:49:29 1.3 @@ -3853,7 +3853,7 @@ #endif __initcall(md_init); -#ifdef CONFIG_AUTODETECT_RAID +#if defined(CONFIG_AUTODETECT_RAID) || defined(CONFIG_MD_BOOT) __initcall(md_run_setup); #endif - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
PATCH - md - MAX_REAL yields to MD_SB_DISKS
Linus, md currently has two #defines which give a limit to the number of devices that can be in a given raid array: MAX_REAL (==12) dates back to the time before we had persistent superblocks, and mostly affects raid0 MD_SB_DISKS (==27) is a characteristic of the newer persistent superblocks and says how many devices can be described in a superblock. Have the two is inconsistent and needlessly limits raid0 arrays. This patch replaces MAX_REAL in the few places that it occurs with MD_SB_DISKS. Thanks to Gary Murakami [EMAIL PROTECTED] for prodding me to make this patch. NeilBrown --- ./include/linux/raid/md_k.h 2000/11/29 04:54:32 1.1 +++ ./include/linux/raid/md_k.h 2000/11/29 04:55:47 1.2 @@ -59,7 +59,6 @@ #error MD doesnt handle bigger kdev yet #endif -#define MAX_REAL 12/* Max number of disks per md dev */ #define MAX_MD_DEVS (1MINORBITS)/* Max number of md dev */ /* --- ./include/linux/raid/raid0.h2000/11/29 04:54:32 1.1 +++ ./include/linux/raid/raid0.h2000/11/29 04:55:47 1.2 @@ -9,7 +9,7 @@ unsigned long dev_offset; /* Zone offset in real dev */ unsigned long size; /* Zone size */ int nb_dev; /* # of devices attached to the zone */ - mdk_rdev_t *dev[MAX_REAL]; /* Devices attached to the zone */ + mdk_rdev_t *dev[MD_SB_DISKS]; /* Devices attached to the zone */ }; struct raid0_hash --- ./drivers/md/md.c 2000/11/29 04:49:29 1.3 +++ ./drivers/md/md.c 2000/11/29 04:55:47 1.4 @@ -3587,9 +3587,9 @@ { static char * name = "mdrecoveryd"; - printk (KERN_INFO "md driver %d.%d.%d MAX_MD_DEVS=%d, MAX_REAL=%d\n", + printk (KERN_INFO "md driver %d.%d.%d MAX_MD_DEVS=%d, MD_SB_DISKS=%d\n", MD_MAJOR_VERSION, MD_MINOR_VERSION, - MD_PATCHLEVEL_VERSION, MAX_MD_DEVS, MAX_REAL); + MD_PATCHLEVEL_VERSION, MAX_MD_DEVS, MD_SB_DISKS); if (devfs_register_blkdev (MAJOR_NR, "md", md_fops)) { @@ -3639,7 +3639,7 @@ unsigned long set; int pers[MAX_MD_BOOT_DEVS]; int chunk[MAX_MD_BOOT_DEVS]; - kdev_t devices[MAX_MD_BOOT_DEVS][MAX_REAL]; + kdev_t devices[MAX_MD_BOOT_DEVS][MD_SB_DISKS]; } md_setup_args md__initdata = { 0, }; /* @@ -3713,7 +3713,7 @@ pername="super-block"; } devnames = str; - for (; iMAX_REAL str; i++) { + for (; iMD_SB_DISKS str; i++) { if ((device = name_to_kdev_t(str))) { md_setup_args.devices[minor][i] = device; } else { - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
PATCH - md.c - confusing message corrected
Linus, This is a resend of a patch that probably got lost a week or so ago. (It is also more gramatically correct). If md.c has two raid arrays that need to be resynced, and they share a physical device, then the two resyncs are serialised. However the message printed says something about "overlapping" which confuses and worries people needlessly. This patch improves the message. NeilBrown --- ./drivers/md/md.c 2000/11/29 04:21:37 1.1 +++ ./drivers/md/md.c 2000/11/29 04:22:13 1.2 @@ -3279,7 +3279,7 @@ if (mddev2 == mddev) continue; if (mddev2-curr_resync match_mddev_units(mddev,mddev2)) { - printk(KERN_INFO "md: serializing resync, md%d has overlapping physical units with md%d!\n", mdidx(mddev), mdidx(mddev2)); + printk(KERN_INFO "md: serializing resync, md%d shares one or +more physical units with md%d!\n", mdidx(mddev), mdidx(mddev2)); serialize = 1; break; } - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Any experience with LSI SYM 53c1010 scsi controller??
Hi, I am considering using an ASUS CUR-DLS mother board in a new NFS/RAID server, and wonder if anyone was any experience to report either with it, or with the Ultra-160 dual buss scsi controller that it has - the LSI SYM 53c1010. From what I can find in the kernel source, and from lsi logic's home page it is supported, but I would love to hear from someone who has used it. Thanks, NeilBrown http://www.asus.com.tw/products/Motherboard/pentiumpro/cur-dls/index.html ftp://ftp.lsil.com/HostAdapterDrivers/linux/c8xx-driver/Readme - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: [BUG] reconstruction doesn't start
On Monday November 27, [EMAIL PROTECTED] wrote: When md2 is finished then md1 is resynced. Shouldn't they do resync at the same time? I never saw "md: serializing resync,..." what I supected to get because md0 and md1 share the same physical disks. My findigs: The md driver in 2.4.0-test11-ac4 does ALL raid-1 resyncs serialized!!! Close. All *reconstructions* are serialised. All *resyncs* are not synchronised. Here *reconstruction* mean when a failed disk has been placed and data and parity is reconstructed onto it. *resync* means that after an unclean shutdown the parity is checked and corrected if necessary. This is an artifact of how the code was written. It is not something that "should be". It is merely something that "is". It is on my todo list to fix this, but not very high. NeilBrown Can someone replicate this with their system? MfG / Regards Friedrich Lobenstock - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: compatability between patched 2.2 and 2.4?
On Tuesday November 7, [EMAIL PROTECTED] wrote: I have a question regarding the diffrences between the 2.2+RAID-patch kernels and the 2.4-test kernels - I was wondering if there are any diffrences between them. For example, if I build systems with a 2.2.17+RAID and later install 2.4 kernels on them will the trasition be seemless as far as RAID goes? Thx, marc Transition should be fine - unless you are using a sparc. But then for sparc, the 2.2 patch didn't really work properly unless you hacked the superblock layout, so you probably know what you are doing anyway. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
PATCH: raid1 - assorted bug fixes
Linus, The following patch addresses a small number of bugs in raid1.c in 2.4.0-test10. 1/ A number of routines that are called from interrupt context used spin_lock_irq / spin_unlock_irq instead of the more appropriate spin_lock_irqsave( ,flags) / spin_unlock_irqrestore( ,flags) This can, and did, lead to deadlocks on an SMP system. 2/ b_rsector and b_rdev are used in a couple of cases *after* generic_make_request has been called. If the underlying devices was, for example, RAID0, these fields would no longer have the assumed values. I have changed these cases to use b_blocknr (scales) and b_dev. This bug could affect correctness if raid1 is used over raid0 or raid-linear or LVM. 3/ In two cases, b_blocknr is calculated by *multiplying* b_rsector by the sector-per-block count instead of *dividing* it. This bug could affect correctness when restarted a read request after a drive failure. NeilBrown --- ./drivers/md/raid1.c2000/11/07 02:14:25 1.1 +++ ./drivers/md/raid1.c2000/11/07 02:15:21 1.2 @@ -91,7 +91,8 @@ static inline void raid1_free_bh(raid1_conf_t *conf, struct buffer_head *bh) { - md_spin_lock_irq(conf-device_lock); + unsigned long flags; + spin_lock_irqsave(conf-device_lock, flags); while (bh) { struct buffer_head *t = bh; bh=bh-b_next; @@ -103,7 +104,7 @@ conf-freebh_cnt++; } } - md_spin_unlock_irq(conf-device_lock); + spin_unlock_irqrestore(conf-device_lock, flags); wake_up(conf-wait_buffer); } @@ -182,10 +183,11 @@ r1_bh-mirror_bh_list = NULL; if (test_bit(R1BH_PreAlloc, r1_bh-state)) { - md_spin_lock_irq(conf-device_lock); + unsigned long flags; + spin_lock_irqsave(conf-device_lock, flags); r1_bh-next_r1 = conf-freer1; conf-freer1 = r1_bh; - md_spin_unlock_irq(conf-device_lock); + spin_unlock_irqrestore(conf-device_lock, flags); } else { kfree(r1_bh); } @@ -229,14 +231,15 @@ static inline void raid1_free_buf(struct raid1_bh *r1_bh) { + unsigned long flags; struct buffer_head *bh = r1_bh-mirror_bh_list; raid1_conf_t *conf = mddev_to_conf(r1_bh-mddev); r1_bh-mirror_bh_list = NULL; - md_spin_lock_irq(conf-device_lock); + spin_lock_irqsave(conf-device_lock, flags); r1_bh-next_r1 = conf-freebuf; conf-freebuf = r1_bh; - md_spin_unlock_irq(conf-device_lock); + spin_unlock_irqrestore(conf-device_lock, flags); raid1_free_bh(conf, bh); } @@ -371,7 +374,7 @@ { struct buffer_head *bh = r1_bh-master_bh; - io_request_done(bh-b_rsector, mddev_to_conf(r1_bh-mddev), + io_request_done(bh-b_blocknr*(bh-b_size9), mddev_to_conf(r1_bh-mddev), test_bit(R1BH_SyncPhase, r1_bh-state)); bh-b_end_io(bh, uptodate); @@ -599,7 +602,7 @@ bh_req = r1_bh-bh_req; memcpy(bh_req, bh, sizeof(*bh)); - bh_req-b_blocknr = bh-b_rsector * sectors; + bh_req-b_blocknr = bh-b_rsector / sectors; bh_req-b_dev = mirror-dev; bh_req-b_rdev = mirror-dev; /* bh_req-b_rsector = bh-n_rsector; */ @@ -643,7 +646,7 @@ /* * prepare mirrored mbh (fields ordered for max mem throughput): */ - mbh-b_blocknr= bh-b_rsector * sectors; + mbh-b_blocknr= bh-b_rsector / sectors; mbh-b_dev= conf-mirrors[i].dev; mbh-b_rdev = conf-mirrors[i].dev; mbh-b_rsector= bh-b_rsector; @@ -1181,7 +1184,7 @@ struct buffer_head *bh1 = mbh; mbh = mbh-b_next; generic_make_request(WRITE, bh1); - md_sync_acct(bh1-b_rdev, bh1-b_size/512); + md_sync_acct(bh1-b_dev, bh1-b_size/512); } } else { dev = bh-b_dev; @@ -1406,7 +1409,7 @@ init_waitqueue_head(bh-b_wait); generic_make_request(READ, bh); - md_sync_acct(bh-b_rdev, bh-b_size/512); + md_sync_acct(bh-b_dev, bh-b_size/512); return (bsize 10); - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]
Re: Linux 2.4.0-test8 and swap/journaling fs on raid
On Wednesday September 27, [EMAIL PROTECTED] wrote: I was just wondering if the issues with swap on a raid device and with using a journaling fs on a raid device had been fixed in the latest 2.4.0-test kernels? Yes. md in 2.4 doesn't do interesting things with the buffer cache, so swap and journaling filesystems should have no issues with it. I've gone too soft over the last few years to read the raid code myself :-) Thanks in advance, Craig PS I might be able to make sense of the code, but my wife would kill me if I spent any more time on the computer. So print it out and read it that way :-) Following cross references is a bit slow though... maybe a source browser on a palm pilot so you can still do it in the family room... (At this point 5 people chime in and tell me about 3 different whizz-bang packages which print C source with line numbers and cross references and ) NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/