Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
On Mon, Feb 04, 2008 at 07:38:40PM +0300, Michael Tokarev wrote: Eric Sandeen wrote: [] http://oss.sgi.com/projects/xfs/faq.html#nulls and note that recent fixes have been made in this area (also noted in the faq) Also - the above all assumes that when a drive says it's written/flushed data, that it truly has. Modern write-caching drives can wreak havoc with any journaling filesystem, so that's one good reason for a UPS. If Unfortunately an UPS does not *really* help here. Because unless it has control program which properly shuts system down on the loss of input power, and the battery really has the capacity to power the system while it's shutting down (anyone tested this? With new UPS? and after an year of use, when the battery is not new?), -- unless the UPS actually has the capacity to shutdown system, it will cut the power at an unexpected time, while the disk(s) still has dirty caches... if the ups is supported by nut (http://www.networkupstools.org) you can do this easily. Obviously you should tune the timeout to give your systems enough time to shutdown in case of power outage, and periodically check your battery duration (that means real tests) and re-tune the nut software (and when you discover your battery is dead, change it) L. -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
Janek Kozicki wrote: writing on raid10 is supposed to be half the speed of reading. That's because it must write to both mirrors. I am not 100% certain about the following rules, but afaik any raid configuration has a theoretical[1] maximum read speed of the combined speed of all disks in the array and a maximum write speed equal to the combined speed of a disk-length of a stripe. By disk-length I mean how many disks are needed to reconstruct a single stripe - the rest of the writes are redundancy and are essentially non-accountable work. For raid5 it is N-1. For raid6 - N-2. For linux raid 10 it is N-C+1 where C is the number of chunk copies. So for -p n3 -n 5 we would get a maximum write speed of 3 x single drive speed. For raid1 the disk-length of a stripe is always 1. So the statement IMHO raid5 could perform good here, because in *continuous* write operation the blocks from other HDDs were just have been written, they stay in cache and can be used to calculate xor. So you could get close to almost raid-0 performance here. is quite incorrect. You will get close to raid-0 if you have many disks, but will never beat raid0, since once disk is always busy writing parity which is not part of the write request submitted to the mdX device in the first place. [1] Theoretical since any external factors (busy CPU, unsuitable elevator, random disk access, multiple raid levels on one physical device) would all contribute to take you further away from the maximums. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: draft howto on making raids for surviving a disk crash
On Sat, Feb 02, 2008 at 08:41:31PM +0100, Keld Jørn Simonsen wrote: Make each of the disks bootable by lilo: lilo -b /dev/sda /etc/lilo.conf1 lilo -b /dev/sdb /etc/lilo.conf2 There should be no need for that. to achieve the above effect with lilo you use raid-extra-boot=mbr-only in lilo.conf Make each of the disks bootable by grub install grub with the command grub-install /dev/md0 L. -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Tuesday 05 February 2008 21:12:32 Neil Brown napisał(a): % mdadm --zero-superblock /dev/sdb1 mdadm: Couldn't open /dev/sdb1 for write - not zeroing That's weird. Why can't it open it? Hell if I know. First time I see such a thing. Maybe you aren't running as root (The '%' prompt is suspicious). I am running as root, the % prompt is the obfuscation part (I have configured bash to display IP as part of prompt). Maybe the kernel has been told to forget about the partitions of /dev/sdb. But fdisk/cfdisk has no problem whatsoever finding the partitions . mdadm will sometimes tell it to do that, but only if you try to assemble arrays out of whole components. If that is the problem, then blockdev --rereadpt /dev/sdb I deleted LVM devices that were sitting on top of RAID and reinstalled mdadm. % blockdev --rereadpt /dev/sdf BLKRRPART: Device or resource busy % mdadm /dev/md2 --fail /dev/sdf1 mdadm: set /dev/sdf1 faulty in /dev/md2 % blockdev --rereadpt /dev/sdf BLKRRPART: Device or resource busy % mdadm /dev/md2 --remove /dev/sdf1 mdadm: hot remove failed for /dev/sdf1: Device or resource busy lsof /dev/sdf1 gives ZERO results. arrrRRRGH Regards, Marcin Krol - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Tuesday 05 February 2008 12:43:31 Moshe Yudkowsky napisał(a): 1. Where this info on array resides?! I have deleted /etc/mdadm/mdadm.conf and /dev/md devices and yet it comes seemingly out of nowhere. /boot has a copy of mdadm.conf so that / and other drives can be started and then mounted. update-initramfs will update /boot's copy of mdadm.conf. Yeah, I found that while deleting mdadm package... Thanks for answers everyone anyway. Regards, Marcin Krol - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Marcin Krol wrote: Tuesday 05 February 2008 21:12:32 Neil Brown napisał(a): % mdadm --zero-superblock /dev/sdb1 mdadm: Couldn't open /dev/sdb1 for write - not zeroing That's weird. Why can't it open it? Hell if I know. First time I see such a thing. Maybe you aren't running as root (The '%' prompt is suspicious). I am running as root, the % prompt is the obfuscation part (I have configured bash to display IP as part of prompt). Maybe the kernel has been told to forget about the partitions of /dev/sdb. But fdisk/cfdisk has no problem whatsoever finding the partitions . mdadm will sometimes tell it to do that, but only if you try to assemble arrays out of whole components. If that is the problem, then blockdev --rereadpt /dev/sdb I deleted LVM devices that were sitting on top of RAID and reinstalled mdadm. % blockdev --rereadpt /dev/sdf BLKRRPART: Device or resource busy % mdadm /dev/md2 --fail /dev/sdf1 mdadm: set /dev/sdf1 faulty in /dev/md2 % blockdev --rereadpt /dev/sdf BLKRRPART: Device or resource busy % mdadm /dev/md2 --remove /dev/sdf1 mdadm: hot remove failed for /dev/sdf1: Device or resource busy lsof /dev/sdf1 gives ZERO results. What does this say: dmsetup table - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
On Wednesday February 6, [EMAIL PROTECTED] wrote: Maybe the kernel has been told to forget about the partitions of /dev/sdb. But fdisk/cfdisk has no problem whatsoever finding the partitions . It is looking at the partition table on disk. Not at the kernel's idea of partitions, which is initialised from that table... What does cat /proc/partitions say? mdadm will sometimes tell it to do that, but only if you try to assemble arrays out of whole components. If that is the problem, then blockdev --rereadpt /dev/sdb I deleted LVM devices that were sitting on top of RAID and reinstalled mdadm. % blockdev --rereadpt /dev/sdf BLKRRPART: Device or resource busy Implies that some partition is in use. % mdadm /dev/md2 --fail /dev/sdf1 mdadm: set /dev/sdf1 faulty in /dev/md2 % blockdev --rereadpt /dev/sdf BLKRRPART: Device or resource busy % mdadm /dev/md2 --remove /dev/sdf1 mdadm: hot remove failed for /dev/sdf1: Device or resource busy OK, that's weird. If sdf1 is faulty, then you should be able to remove it. What does cat /proc/mdstat dmesg | tail say at this point? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Wednesday 06 February 2008 11:11:51 Peter Rabbitson napisał(a): lsof /dev/sdf1 gives ZERO results. What does this say: dmsetup table % dmsetup table vg-home: 0 61440 linear 9:2 384 Regards, Marcin Krol - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Marcin Krol wrote: Hello everyone, I have had a problem with RAID array (udev messed up disk names, I've had RAID on disks only, without raid partitions) Do you mean that you originally used /dev/sdb for the RAID array? And now you are using /dev/sdb1? Given the system seems confused I wonder if this may be relevant? David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Wednesday 06 February 2008 12:22:00: I have had a problem with RAID array (udev messed up disk names, I've had RAID on disks only, without raid partitions) Do you mean that you originally used /dev/sdb for the RAID array? And now you are using /dev/sdb1? That's reconfigured now, it doesn't matter (started up the host in single user, created partitions as opposed to running RAID previously on whole disks). Given the system seems confused I wonder if this may be relevant? I don't think so, I tried most mdadm operations (fail, remove, etc) on disks (like sdb) and partitions (like sdb1) and get identical messages for either. -- Marcin Krol - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Wednesday 06 February 2008 11:43:12: On Wednesday February 6, [EMAIL PROTECTED] wrote: Maybe the kernel has been told to forget about the partitions of /dev/sdb. But fdisk/cfdisk has no problem whatsoever finding the partitions . It is looking at the partition table on disk. Not at the kernel's idea of partitions, which is initialised from that table... Aha! Thanks for this bit. I get it now. What does cat /proc/partitions say? Note: I have reconfigured udev now to associate device names with serial numbers (below) % cat /proc/partitions major minor #blocks name 8 0 390711384 sda 8 1 390708801 sda1 816 390711384 sdb 817 390708801 sdb1 832 390711384 sdc 833 390708801 sdc1 848 390710327 sdd 849 390708801 sdd1 864 390711384 sde 865 390708801 sde1 880 390711384 sdf 881 390708801 sdf1 364 78150744 hdb 3651951866 hdb1 3667815622 hdb2 3674883760 hdb3 368 1 hdb4 369 979933 hdb5 370 979933 hdb6 371 61536951 hdb7 9 1 781417472 md1 9 0 781417472 md0 /dev/disk/by-id % ls -l total 0 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-ST380023A_3KB0MV22 - ../../hdb lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part1 - ../../hdb1 lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part2 - ../../hdb2 lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part3 - ../../hdb3 lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part4 - ../../hdb4 lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part5 - ../../hdb5 lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part6 - ../../hdb6 lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part7 - ../../hdb7 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1696130 - ../../d_6 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1696130-part1 - ../../d_6 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1707974 - ../../d_5 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1707974-part1 - ../../d_5 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1795228 - ../../d_1 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1795228-part1 - ../../d_1 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1795364 - ../../d_3 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1795364-part1 - ../../d_3 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1798692 - ../../d_2 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1798692-part1 - ../../d_2 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1800255 - ../../d_4 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1800255-part1 - ../../d_4 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1696130 - ../../d_6 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1696130-part1 - ../../d_6 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1707974 - ../../d_5 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1707974-part1 - ../../d_5 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1795228 - ../../d_1 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1795228-part1 - ../../d_1 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1795364 - ../../d_3 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1795364-part1 - ../../d_3 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1798692 - ../../d_2 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1798692-part1 - ../../d_2 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1800255 - ../../d_4 lrwxrwxrwx 1 root root 9 2008-02-06 13:34 scsi-S_WD-WMAMY1800255-part1 - ../../d_4 I have no idea why udev can't allocate /dev/d_1p1 to partition 1 on disk d_1. I have explicitly asked it to do that: /etc/udev/rules.d % cat z24_disks_domeny.rules KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1795228, NAME=d_1 KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1795228-part1, NAME=d_1p1 KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1798692, NAME=d_2 KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1798692-part1, NAME=d_2p1 KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1795364, NAME=d_3 KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1795364-part1, NAME=d_3p1 KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1800255, NAME=d_4 KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1800255-part1, NAME=d_4p1 KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1707974, NAME=d_5 KERNEL==sd*, SUBSYSTEM==block,
Disk failure during grow, what is the current state.
Hi All, I was wondering if someone might be willing to confirm what the current state of my RAID array is, given the following sequence of events (sorry it's pretty long) I had a clean, running /dev/md0 using 5 disks in RAID 5 (sda1, sdb1, sdc1, sdd1, hdd1). It had been clean like that for a while. So last night I decided it was safe to grow the array into a sixth disk [EMAIL PROTECTED] ~]# mdadm /dev/md0 --add /dev/hdi1 mdadm: added /dev/hdi1 [EMAIL PROTECTED] ~]# mdadm -D /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Wed Jan 9 18:57:53 2008 Raid Level : raid5 Array Size : 1953535744 (1863.04 GiB 2000.42 GB) Used Dev Size : 488383936 (465.76 GiB 500.11 GB) Raid Devices : 5 Total Devices : 6 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue Feb 5 23:55:59 2008 State : clean Active Devices : 5 Working Devices : 6 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K UUID : 382c157a:405e0640:c30f9e9e:888a5e63 Events : 0.429616 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 8 171 active sync /dev/sdb1 2 8 332 active sync /dev/sdc1 3 22 653 active sync /dev/hdd1 4 8 494 active sync /dev/sdd1 5 561- spare /dev/hdi1 [EMAIL PROTECTED] ~]# mdadm --grow /dev/md0 --raid-devices=6 mdadm: Need to backup 1280K of critical section.. mdadm: ... critical section passed. [EMAIL PROTECTED] ~]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 hdi1[5] sdd1[4] sdc1[2] sdb1[1] sda1[0] hdd1[3] 1953535744 blocks super 0.91 level 5, 64k chunk, algorithm 2 [6/6] [UU] [] reshape = 0.0% (29184/488383936) finish=2787.4min speed=2918K/sec unused devices: none [EMAIL PROTECTED] ~]# OK, so that would take nearly 2 days to complete, so I went to bed happy about 10 hours ago. I come to the machine this morning, and I have the following [EMAIL PROTECTED] ~]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 hdi1[5] sdd1[6](F) sdc1[2] sdb1[1] sda1[0] hdd1[3] 1953535744 blocks super 0.91 level 5, 64k chunk, algorithm 2 [6/5] [_U] unused devices: none You have new mail in /var/spool/mail/root [EMAIL PROTECTED] ~]# mdadm -D /dev/md0 /dev/md0: Version : 00.91.03 Creation Time : Wed Jan 9 18:57:53 2008 Raid Level : raid5 Array Size : 1953535744 (1863.04 GiB 2000.42 GB) Used Dev Size : 488383936 (465.76 GiB 500.11 GB) Raid Devices : 6 Total Devices : 6 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Feb 6 05:28:09 2008 State : clean, degraded Active Devices : 5 Working Devices : 5 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Delta Devices : 1, (5-6) UUID : 382c157a:405e0640:c30f9e9e:888a5e63 Events : 0.470964 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 8 171 active sync /dev/sdb1 2 8 332 active sync /dev/sdc1 3 22 653 active sync /dev/hdd1 4 004 removed 5 5615 active sync /dev/hdi1 6 8 49- faulty spare [EMAIL PROTECTED] ~]# df -k Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroup00-LogVol00 56086828 11219432 41972344 22% / /dev/hda1 101086 18281 77586 20% /boot /dev/md0 1922882096 1775670344 69070324 97% /Downloads tmpfs 513556 0513556 0% /dev/shm [EMAIL PROTECTED] ~]# mdadm /dev/md0 --remove /dev/sdd1 mdadm: cannot find /dev/sdd1: No such file or directory [EMAIL PROTECTED] ~]# As you can see, one of the original 5 devices has failed (sdd1) and automatically removed. The reshape has stopped, but the new disk seems to be in and clean which is the bit I don't understand. The new disk hasn't been added to the size, so it would seem that md has switched it to being used as a spare instead (possibly as the grow hadn't completed?). How come it seems to have recovered so nicely? Is there something I can do to check it's integrity? Was it just so much quicker than 2 days because it switched to only having to sort out the 1 disk? Would it be safe to run an fsck to check the integrity of the fs? I don't want to inadvertently blat the raid array by 'using' it when it's in a dodgy state. I have unmounted the drive for the time being, so that it doesn't get any writes until I know what state
Re: draft howto on making raids for surviving a disk crash
Keld Jørn Simonsen wrote: Make each of the disks bootable by grub (to be described) It would probably be good to show how to use grub shell's install command. It's the most flexible way and give the most (or rather total) control. I could write some examples. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Purpose of Document? (was Re: draft howto on making raids for surviving a disk crash)
I read through the document, and I've signed up for a Wiki account so I can edit it. One of the things I wanted to do was correct the title. I see that there are *three* different Wiki pages about how to build a system that boots from RAID. None of them are complete yet. So, what is the purpose of this page? I think the purpose is a complete description of how to use RAID to build a system that not only boots from RAID but is robust against other hazards such as file system corruption. -- Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe If you pay peanuts, you get monkeys. Edward Yourdon, _The Decline and Fall of the American Progammer_ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Disk failure during grow, what is the current state.
- Message from [EMAIL PROTECTED] - Date: Wed, 6 Feb 2008 12:58:55 - From: Steve Fairbairn [EMAIL PROTECTED] Reply-To: Steve Fairbairn [EMAIL PROTECTED] Subject: Disk failure during grow, what is the current state. To: linux-raid@vger.kernel.org As you can see, one of the original 5 devices has failed (sdd1) and automatically removed. The reshape has stopped, but the new disk seems to be in and clean which is the bit I don't understand. The new disk hasn't been added to the size, so it would seem that md has switched it to being used as a spare instead (possibly as the grow hadn't completed?). How come it seems to have recovered so nicely? Is there something I can do to check it's integrity? Was it just so much quicker than 2 days because it switched to only having to sort out the 1 disk? Would it be safe to run an fsck to check the integrity of the fs? I don't want to inadvertently blat the raid array by 'using' it when it's in a dodgy state. I have unmounted the drive for the time being, so that it doesn't get any writes until I know what state it is really in. - End message from [EMAIL PROTECTED] - If a drive failes during reshape the reshape will just continue. The blocks which were on the failed drive are calculated from the the other disks and writes to the failed disk are simply omitted. The result is a raid5 with a failed drive. You should get a new drive asap to restore the redundancy. Also it's kinda important that you don't run 2.6.23 because it has a nasty bug which would be triggered in this scenario. The reshape probably increased in speed after the system was no longer actively used and io bandwidth freed up. Kind regards, Alex. #_ __ _ __ http://www.nagilum.org/ \n icq://69646724 # # / |/ /__ _(_) /_ _ [EMAIL PROTECTED] \n +491776461165 # # // _ `/ _ `/ / / // / ' \ Amiga (68k/PPC): AOS/NetBSD/Linux # # /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/ Mac (PPC): MacOS-X / NetBSD /Linux # # /___/ x86: FreeBSD/Linux/Solaris/Win2k ARM9: EPOC EV6 # cakebox.homeunix.net - all the machine one needs.. pgpCS18uvCIqa.pgp Description: PGP Digital Signature
FW: Disk failure during grow, what is the current state.
I'm having a nightmare with emails today. I can't get a single one right first time. Apologies to Alex for sending it directly to him and not to the list on first attempt. Steve -Original Message- From: Steve Fairbairn [mailto:[EMAIL PROTECTED] Sent: 06 February 2008 15:02 To: 'Nagilum' Subject: RE: Disk failure during grow, what is the current state. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nagilum Sent: 06 February 2008 14:34 To: Steve Fairbairn Cc: linux-raid@vger.kernel.org Subject: Re: Disk failure during grow, what is the current state. If a drive failes during reshape the reshape will just continue. The blocks which were on the failed drive are calculated from the the other disks and writes to the failed disk are simply omitted. The result is a raid5 with a failed drive. You should get a new drive asap to restore the redundancy. Also it's kinda important that you don't run 2.6.23 because it has a nasty bug which would be triggered in this scenario. The reshape probably increased in speed after the system was no longer actively used and io bandwidth freed up. Kind regards, Alex. Thanks for the response Alex, but Array Size : 1953535744 (1863.04 GiB 2000.42 GB) Used Dev Size : 488383936 (465.76 GiB 500.11 GB) Surely the added disk should now been added to the Array Size? 5 * 500GB is 2500GB, not 2000GB. This is why I don't think the reshape has continued. As for speeding up because of no IO badwidth, this also doesn't actually hold very true, because the system was at a point of not being used anyway before I added the disk, and I didn't unmount the drive until this morning after it claimed it had finished doing anything. It's because the size doesn't match up to all 5 disks being used that I still wonder at the state of the array. Steve. No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.19.20/1261 - Release Date: 05/02/2008 20:57 No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.19.20/1261 - Release Date: 05/02/2008 20:57 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Purpose of Document? (was Re: draft howto on making raids for surviving a disk crash)
On Wed, Feb 06, 2008 at 08:24:37AM -0600, Moshe Yudkowsky wrote: I read through the document, and I've signed up for a Wiki account so I can edit it. One of the things I wanted to do was correct the title. I see that there are *three* different Wiki pages about how to build a system that boots from RAID. None of them are complete yet. So, what is the purpose of this page? I think the purpose is a complete description of how to use RAID to build a system that not only boots from RAID but is robust against other hazards such as file system corruption. You are right about that there are more than one wiki page addressing very related issues. I also considered whether there was a need for the new page, and discussed it with David. And yes, my idea was to make a howto on building a system that can survive a disk crash. A simple system that can also work for a workstation. In fact the main audience is possibly here. so my focus is: survive a failing disk, and keep it simple. Best regards Keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: draft howto on making raids for surviving a disk crash
On Wed, Feb 06, 2008 at 10:05:58AM +0100, Luca Berra wrote: On Sat, Feb 02, 2008 at 08:41:31PM +0100, Keld Jørn Simonsen wrote: Make each of the disks bootable by lilo: lilo -b /dev/sda /etc/lilo.conf1 lilo -b /dev/sdb /etc/lilo.conf2 There should be no need for that. to achieve the above effect with lilo you use raid-extra-boot=mbr-only in lilo.conf Make each of the disks bootable by grub install grub with the command grub-install /dev/md0 I have already changed the text on the wiki. Still I am not convinced it is the best advice that is described. best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Disk failure during grow, what is the current state.
-Original Message- From: Steve Fairbairn [mailto:[EMAIL PROTECTED] Sent: 06 February 2008 15:02 To: 'Nagilum' Subject: RE: Disk failure during grow, what is the current state. Array Size : 1953535744 (1863.04 GiB 2000.42 GB) Used Dev Size : 488383936 (465.76 GiB 500.11 GB) Surely the added disk should now been added to the Array Size? 5 * 500GB is 2500GB, not 2000GB. This is why I don't think the reshape has continued. As for speeding up because of no IO badwidth, this also doesn't actually hold very true, because the system was at a point of not being used anyway before I added the disk, and I didn't unmount the drive until this morning after it claimed it had finished doing anything. Thanks again to Alex for his comments. I've just rebooted the box, and the reshape has continued on the degraded array and an RMA has been raised for the faulty disk. Thanks, Steve. No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.19.20/1261 - Release Date: 05/02/2008 20:57 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 on three discs - few questions.
Neil Brown wrote: On Sunday February 3, [EMAIL PROTECTED] wrote: Hi, Maybe I'll buy three HDDs to put a raid10 on them. And get the total capacity of 1.5 of a disc. 'man 4 md' indicates that this is possible and should work. I'm wondering - how a single disc failure is handled in such configuration? 1. does the array continue to work in a degraded state? Yes. 2. after the failure I can disconnect faulty drive, connect a new one, start the computer, add disc to array and it will sync automatically? Yes. Question seems a bit obvious, but the configuration is, at least for me, a bit unusual. This is why I'm asking. Anybody here tested such configuration, has some experience? 3. Another thing - would raid10,far=2 work when three drives are used? Would it increase the read performance? Yes. 4. Would it be possible to later '--grow' the array to use 4 discs in raid10 ? Even with far=2 ? No. Well if by later you mean in five years, then maybe. But the code doesn't currently exist. That's a reason to avoid raid10 for certain applications, then, and go with a more manual 1+0 or similar. Can you create a raid10 with one drive missing and add it later? I know, I should try it when I get a machine free... but I'm being lazy today. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 or raid10 for /boot
Keld Jørn Simonsen wrote: I understand that lilo and grub only can boot partitions that look like a normal single-drive partition. And then I understand that a plain raid10 has a layout which is equivalent to raid1. Can such a raid10 partition be used with grub or lilo for booting? And would there be any advantages in this, for example better disk utilization in the raid10 driver compared with raid? I don't know about you, but my /boot goes with zero use between boots, efficiency and performance improvements strike as a distinction without a difference, while adding complexity without benefit is always a bad idea. I suggest that you avoid having a learning experience and stick with raid1. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 on three discs - few questions.
On Feb 6, 2008 12:43 PM, Bill Davidsen [EMAIL PROTECTED] wrote: Can you create a raid10 with one drive missing and add it later? I know, I should try it when I get a machine free... but I'm being lazy today. Yes you can. With 3 drives, however, performance will be awful (at least with layout far, 2 copies). IMO raid10,f2 is a great balance of speed and redundancy. it''s faster than raid5 for reading, about the same for writing. it's even potentially faster than raid0 for reading, actually. With 3 disks one should be able to get 3.0 times the speed of one disk, or slightly more, and each stripe involves only *one* disk instead of 2 as it does with raid5. -- Jon - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
Keld Jørn Simonsen wrote: Hi I am looking at revising our howto. I see a number of places where a chunk size of 32 kiB is recommended, and even recommendations on maybe using sizes of 4 kiB. Depending on the raid level, a write smaller than the chunk size causes the chunk to be read, altered, and rewritten, vs. just written if the write is a multiple of chunk size. Many filesystems by default use a 4k page size and writes. I believe this is the reasoning behind the suggestion of small chunk sizes. Sequential vs. random and raid level are important here, there's no one size to work best in all cases. My own take on that is that this really hurts performance. Normal disks have a rotation speed of between 5400 (laptop) 7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average spinning time for one round of 6 to 12 ms, and average latency of half this, that is 3 to 6 ms. Then you need to add head movement which is something like 2 to 20 ms - in total average seek time 5 to 26 ms, averaging around 13-17 ms. Having a write not some multiple of chunk size would seem to require a read-alter- wait_for_disk_rotation-write, and for large sustained sequential i/o using multiple drives helps transfer. for small random i/o small chunks are good, I find little benefit to chunks over 256 or maybe 1024k. in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 something like between 600 to 1200 kB, actual transfer rates of 80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck, and transfer some data you should have something like 256/512 kiB chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB giving about a time of 20 ms per transaction you should be able with random reads to transfer 12 MB/s - my actual figures is about 30 MB/s which is possibly because of the elevator effect of the file system driver. With a size of 4 kb per chunk you should have a time of 15 ms per transaction, or 66 transactions per second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up the transfer by a factor of 50. If you actually see anything like this your write caching and readahead aren't doing what they should! I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk. I also see that there are some memory constrints on this. Having maybe 1000 processes reading, as for my mirror service, 256 kib buffers would be acceptable, occupying 256 MB RAM. That is reasonable, and I could even tolerate 512 MB ram used. But going to 1 MiB buffers would be overdoing it for my configuration. What would be the recommended chunk size for todays equipment? I think usage is more important than hardware. My opinion only. Best regards Keld -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re[4]: mdadm 2.6.4 : How i can check out current status of reshaping ?
Hello, Neil. . Possible you have bad memory, or a bad CPU, or you are overclocking the CPU, or it is getting hot, or something. As seems to me all my problems has been started after i have started update MDADM. This is server worked normaly (but only not like soft-raid) more 2-3 years. Last 6 months it worked as soft-raid. All was normaly, Even I have added successfully 4th hdd into raid5 )when it stared was 3 hdd). And then Reshaping have been passed fine. Yesterday i have did memtest86 onto it server and 10 passes was WITH OUT any errors. Temperature of server is about 25 grad celsius. No overlocking, all set to default. Realy i do not know what to do because off wee nedd grow our storage, and we can not. unfortunately, At this moment - Mdadm do not help us in this decision, but very want it get. But you clearly have a hardware error. NeilBrown -- Best regards, Andreas-Sokov - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
In message [EMAIL PROTECTED] you wrote: I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk. Indeed kernel page size is an important factor in such optimizations. But you have to keep in mind that this is mostly efficient for (very) large strictly sequential I/O operations only - actual file system traffic may be *very* different. We implemented the option to select kernel page sizes of 4, 16, 64 and 256 kB for some PowerPC systems (440SPe, to be precise). A nice graphics of the effect can be found here: https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf Best regards, Wolfgang Denk -- DENX Software Engineering GmbH, MD: Wolfgang Denk Detlev Zundel HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: [EMAIL PROTECTED] You got to learn three things. What's real, what's not real, and what's the difference. - Terry Pratchett, _Witches Abroad_ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 on three discs - few questions.
Jon Nelson wrote: On Feb 6, 2008 12:43 PM, Bill Davidsen [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Can you create a raid10 with one drive missing and add it later? I know, I should try it when I get a machine free... but I'm being lazy today. Yes you can. With 3 drives, however, performance will be awful (at least with layout far, 2 copies). Well, the question didn't include being fast. ;-) But if he really wants to create the array now and be able to add to it later, it might still be useful, particularly if later is a small time like when my other drive ships. Thanks for the input, I thought that was possible, but reading code isn't the same as testing. IMO raid10,f2 is a great balance of speed and redundancy. it''s faster than raid5 for reading, about the same for writing. it's even potentially faster than raid0 for reading, actually. With 3 disks one should be able to get 3.0 times the speed of one disk, or slightly more, and each stripe involves only *one* disk instead of 2 as it does with raid5. I have used raid10 swap on 3 or more drives fairly often. Other than the Fedora rescue CD not using the space until I start it manually, I find it really fast, and helpful for huge image work. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
Wolfgang Denk wrote: In message [EMAIL PROTECTED] you wrote: I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk. Indeed kernel page size is an important factor in such optimizations. But you have to keep in mind that this is mostly efficient for (very) large strictly sequential I/O operations only - actual file system traffic may be *very* different. That was actually what I meant by page size, that of the file system rather than the memory, ie. the block size typically used for writes. Or multiples thereof, obviously. We implemented the option to select kernel page sizes of 4, 16, 64 and 256 kB for some PowerPC systems (440SPe, to be precise). A nice graphics of the effect can be found here: https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf I started that online and pulled a download to print, very neat stuff. Thanks for the link. Best regards, Wolfgang Denk -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
Bill Davidsen said: (by the date of Wed, 06 Feb 2008 13:16:14 -0500) Janek Kozicki wrote: Justin Piszcz said: (by the date of Tue, 5 Feb 2008 17:28:27 -0500 (EST)) writing on raid10 is supposed to be half the speed of reading. That's because it must write to both mirrors. ??? Are you assuming that write to mirrored copies are done sequentially rather than in parallel? Unless you have enough writes to saturate something the effective speed approaches the speed of a single drive. I just checked raid1 and raid5, writing 100MB with an fsync at the end. raid1 leveled off at 85% of a single drive after ~30MB. Hi, In above context I'm talking about raid10 (not about raid1, raid0, raid0+1, raid1+0, raid5 or raid6). Of course writes are done in parallel. When each chunk has two copies raid10 reads twice as fast as it writes. If each chunk has three copies, then writes are 1/3 speed of reading. If each chunk has number of copies equal to number of drives, then write speed drops down to that of a single drive - a 1/Nth of read speed. But it's all just a theory. I'd like to see more benchmarks :-) -- Janek Kozicki | - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm 2.6.4 : How i can check out current status of reshaping ?
Andreas-Sokov said: (by the date of Wed, 6 Feb 2008 22:15:05 +0300) Hello, Neil. . Possible you have bad memory, or a bad CPU, or you are overclocking the CPU, or it is getting hot, or something. As seems to me all my problems has been started after i have started update MDADM. what is the update? - you installed a new version of mdadm? - you installed new kernel? - something else? - what was the version before, and what version is now? - can you downgrade to previous version? best regards -- Janek Kozicki | - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 or raid10 for /boot
On Wed, Feb 06, 2008 at 01:52:11PM -0500, Bill Davidsen wrote: Keld Jørn Simonsen wrote: I understand that lilo and grub only can boot partitions that look like a normal single-drive partition. And then I understand that a plain raid10 has a layout which is equivalent to raid1. Can such a raid10 partition be used with grub or lilo for booting? And would there be any advantages in this, for example better disk utilization in the raid10 driver compared with raid? I don't know about you, but my /boot goes with zero use between boots, efficiency and performance improvements strike as a distinction without a difference, while adding complexity without benefit is always a bad idea. I suggest that you avoid having a learning experience and stick with raid1. I agree with you, it was only a theoretical question. Best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Wed, Feb 06, 2008 at 09:25:36PM +0100, Wolfgang Denk wrote: In message [EMAIL PROTECTED] you wrote: I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk. Indeed kernel page size is an important factor in such optimizations. But you have to keep in mind that this is mostly efficient for (very) large strictly sequential I/O operations only - actual file system traffic may be *very* different. We implemented the option to select kernel page sizes of 4, 16, 64 and 256 kB for some PowerPC systems (440SPe, to be precise). A nice graphics of the effect can be found here: https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf Yes, that is also what I would expect, for sequential reads. Random writes of small data blocks, kind of what is done in bug data bases, should show another picture as others also have described. If you look at a single disk, would you get improved performance with the asyncroneous IO? I am a bit puzzled about my SATA-II performance: nominally I could get 300 MB/s on SATA-II, but I only get about 80 MB/s. Why is that? I thought it was because of latency with syncroneous reads. Ie, when a chunk is read, yo need to complete the IO operation, and then issue an new one. In the meantime while the CPU is doing these calculations, te disk has spun a little, and to get the next data chunk, we need to wait for the disk to spin around to have the head positioned over the right data pace on the disk surface. Is that so? Or does the controller take care of this, reading the rest of the not-yet-requested track into a buffer, which then can be delivered next time. Modern disks often have buffers of about 8 or 16 MB. I wonder why they don't have bigger buffers. Anyway, why does a SATA-II drive not deliver something like 300 MB/s? best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 on three discs - few questions.
On Wednesday February 6, [EMAIL PROTECTED] wrote: 4. Would it be possible to later '--grow' the array to use 4 discs in raid10 ? Even with far=2 ? No. Well if by later you mean in five years, then maybe. But the code doesn't currently exist. That's a reason to avoid raid10 for certain applications, then, and go with a more manual 1+0 or similar. Not really. You cannot reshape a raid0 either. Can you create a raid10 with one drive missing and add it later? I know, I should try it when I get a machine free... but I'm being lazy today. Yes, but then the array would be degraded and a single failure could destroy your data. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
On Wednesday February 6, [EMAIL PROTECTED] wrote: % cat /proc/partitions major minor #blocks name 8 0 390711384 sda 8 1 390708801 sda1 816 390711384 sdb 817 390708801 sdb1 832 390711384 sdc 833 390708801 sdc1 848 390710327 sdd 849 390708801 sdd1 864 390711384 sde 865 390708801 sde1 880 390711384 sdf 881 390708801 sdf1 364 78150744 hdb 3651951866 hdb1 3667815622 hdb2 3674883760 hdb3 368 1 hdb4 369 979933 hdb5 370 979933 hdb6 371 61536951 hdb7 9 1 781417472 md1 9 0 781417472 md0 So all the expected partitions are known to the kernel - good. /etc/udev/rules.d % cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md0 : active(auto-read-only) raid5 sdc1[0] sde1[3](S) sdd1[1] 781417472 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_] md1 : active(auto-read-only) raid5 sdf1[0] sdb1[3](S) sda1[1] 781417472 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_] md0 consists of sdc1, sde1 and sdd1 even though when creating I asked it to use d_1, d_2 and d_3 (this is probably written on the particular disk/partition itself, but I have no idea how to clean this up - mdadm --zero-superblock /dev/d_1 again produces mdadm: Couldn't open /dev/d_1 for write - not zeroing) I suspect it is related to the (auto-read-only). The array is degraded and has a spare, so it wants to do a recovery to the spare. But it won't start the recovery until the array is not read-only. But the recovery process has partly started (you'll see an md1_resync thread) so it won't let go of any fail devices at the moment. If you mdadm -w /dev/md0 the recovery will start. Then mdadm /dev/md0 -f /dev/d_1 will fail d_1, abort the recovery, and release d_1. Then mdadm --zero-superblock /dev/d_1 should work. It is currently failing with EBUSY - --zero-superblock opens the device with O_EXCL to ensure that it isn't currently in use, and as long as it is part of an md array, O_EXCL will fail. I should make that more explicit in the error message. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Thu, Feb 07, 2008 at 01:31:16AM +0100, Keld Jørn Simonsen wrote: Anyway, why does a SATA-II drive not deliver something like 300 MB/s? Wait, are you talking about a *single* drive? In that case, it seems you are confusing the interface speed (300MB/s) with the mechanical read speed (80MB/s). If you are asking why is a single drive limited to 80 MB/s, I guess it's a problem of mechanics. Even with NCQ or big readahead settings, ~80-~100 MB/s is the highest I've seen on 7200 RPM drives. And yes, there is no wait until the CPU processes the current data until the drive reads the next data; drives have a builtin read-ahead mechanism. Honestly, I have 10x as many problems with the low random I/O throughput rather than with the (high, IMHO) sequential I/O speed. regards, iustin - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Wednesday February 6, [EMAIL PROTECTED] wrote: We implemented the option to select kernel page sizes of 4, 16, 64 and 256 kB for some PowerPC systems (440SPe, to be precise). A nice graphics of the effect can be found here: https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf Thanks for the link! quote The second improvement is to remove a memory copy that is internal to the MD driver. The MD driver stages strip data ready to be written next to the I/O controller in a page size pre- allocated buffer. It is possible to bypass this memory copy for sequential writes thereby saving SDRAM access cycles. /quote I sure hope you've checked that the filesystem never (ever) changes a buffer while it is being written out. Otherwise the data written to disk might be different from the data used in the parity calculation :-) And what are the Second memcpy and First memcpy in the graph? I assume one is the memcpy mentioned above, but what is the other? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Wednesday February 6, [EMAIL PROTECTED] wrote: Keld Jørn Simonsen wrote: Hi I am looking at revising our howto. I see a number of places where a chunk size of 32 kiB is recommended, and even recommendations on maybe using sizes of 4 kiB. Depending on the raid level, a write smaller than the chunk size causes the chunk to be read, altered, and rewritten, vs. just written if the write is a multiple of chunk size. Many filesystems by default use a 4k page size and writes. I believe this is the reasoning behind the suggestion of small chunk sizes. Sequential vs. random and raid level are important here, there's no one size to work best in all cases. Not in md/raid. RAID4/5/6 will do a read-modify-write if you are writing less than one *page*, but then they often to read-modify-write anyway for parity updates. No level will every read a whole chunk just because it is a chunk. To answer the original question: The only way to be sure is to test your hardware with your workload with different chunk sizes. But I suspect that around 256K is good on current hardware. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Thursday February 7, [EMAIL PROTECTED] wrote: Anyway, why does a SATA-II drive not deliver something like 300 MB/s? Are you serious? I high end 15000RPM enterprise grade drive such as the Seagate Cheetah® 15K.6 Hard Drives only deliver 164MB/sec. The SATA Bus might be able to deliver 300MB/s, but an individual drive would be around 80MB/s unless it is really expensive. (or was that yesterday? I'm having trouble keeping up with the pace of improvement :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html