Re: mismatch_cnt != 0
On Sun, 24 Feb 2008, Janek Kozicki wrote: Justin Piszcz said: (by the date of Sun, 24 Feb 2008 04:26:39 -0500 (EST)) Kernel 2.6.24.2 I've seen it on different occasions, for this last time though it may have been due to a power outage that lasted 2hours and obviously the UPS did not hold up that long. you should connect UPS through RS-232 or USB, and if a power-down event is detected - issue hibernate or shutdown. Currently I am issuing hibernate in this case, works pretty well for 2.6.22 and up. -- Janek Kozicki | - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I have it hooked up but it was a weird day for the power going on and off many times for upwards of 2-3 hours and then it died for 2+ hours. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: board/controller recommendations?
On Mon, 25 Feb 2008, Dexter Filmore wrote: Currently my array consists of four Samsung Spinpoint sATA drives, I'm about to enlarge to 6 drive. As of now they sit on an Sil3114 controller via PCI, hence there's a bottleneck, can't squeeze more than 15-30 megs write speed (rather 15 today as the xfs partitions on it are brim full and started fragmenting). Now, I'd like to go for a AMD board with 6 sATA channels connected via PCIe - can someone recomend a board here? Preferrably AMD 690 based so I won't need a video card or similar. Dex -- -BEGIN GEEK CODE BLOCK- Version: 3.12 GCS d--(+)@ s-:+ a- C UL++ P+++ L+++ E-- W++ N o? K- w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@ b++(+++) DI+++ D- G++ e* h++ r* y? --END GEEK CODE BLOCK-- http://www.vorratsdatenspeicherung.de - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html That's always the question, which mobo? I went Intel as many of their chipsets (965, p35, x38) have 6 SATA, I am sure AMD have some as well though, what I bought awhile back was a 6 port sata w/ 3 pci-e x1 and 1 pci-e x16. Then you buy the 2 port sata cards (x1) and plugin your drives. Promise also came out with a 4 port PCI-e x1 card but I have not tried it, seen any reviews for it and do not know if it is even supported in linux. Also, I'd recommend you run a check/resync on your array before removing it from your current box, and then make sure the two new drives do not have any problems, and (to be safe?) expand by adding 1 drive at a time? Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: board/controller recommendations?
On Mon, 25 Feb 2008, Dexter Filmore wrote: On Monday 25 February 2008 15:02:31 Justin Piszcz wrote: On Mon, 25 Feb 2008, Dexter Filmore wrote: Currently my array consists of four Samsung Spinpoint sATA drives, I'm about to enlarge to 6 drive. As of now they sit on an Sil3114 controller via PCI, hence there's a bottleneck, can't squeeze more than 15-30 megs write speed (rather 15 today as the xfs partitions on it are brim full and started fragmenting). Now, I'd like to go for a AMD board with 6 sATA channels connected via PCIe - can someone recomend a board here? Preferrably AMD 690 based so I won't need a video card or similar. Dex -- -BEGIN GEEK CODE BLOCK- Version: 3.12 GCS d--(+)@ s-:+ a- C UL++ P+++ L+++ E-- W++ N o? K- w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@ b++(+++) DI+++ D- G++ e* h++ r* y? --END GEEK CODE BLOCK-- http://www.vorratsdatenspeicherung.de - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html That's always the question, which mobo? I went Intel as many of their chipsets (965, p35, x38) have 6 SATA, I am sure AMD have some as well though, what I bought awhile back was a 6 port sata w/ 3 pci-e x1 and 1 pci-e x16. Then you buy the 2 port sata cards (x1) and plugin your drives. Intel means big bucks since I'd need an intel cpu, too. Cheapest lga775 would be around 90 euros where I get a midrange amd x2 at 50-60. Promise also came out with a 4 port PCI-e x1 card but I have not tried it, seen any reviews for it and do not know if it is even supported in linux. Now *that's* Promis-ing (huh huh) - happen to know the model name? http://www.newegg.com/Product/Product.aspx?Item=N82E16816102117 Type SATA / SAS Also, I'd recommend you run a check/resync on your array before removing it from your current box, and then make sure the two new drives do not have any problems, and (to be safe?) expand by adding 1 drive at a time? Neil Brown told me to expand 2 drives at once, but I'll back up the array anyway to be safe and simply recreate. I guess selling the 750gig drive at ebay with 5 bucks off should do :) -- -BEGIN GEEK CODE BLOCK- Version: 3.12 GCS d--(+)@ s-:+ a- C UL++ P+++ L+++ E-- W++ N o? K- w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@ b++(+++) DI+++ D- G++ e* h++ r* y? --END GEEK CODE BLOCK-- http://www.vorratsdatenspeicherung.de - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: board/controller recommendations?
On Mon, 25 Feb 2008, Dexter Filmore wrote: On Monday 25 February 2008 19:50:52 Justin Piszcz wrote: On Mon, 25 Feb 2008, Dexter Filmore wrote: On Monday 25 February 2008 15:02:31 Justin Piszcz wrote: On Mon, 25 Feb 2008, Dexter Filmore wrote: Currently my array consists of four Samsung Spinpoint sATA drives, I'm about to enlarge to 6 drive. As of now they sit on an Sil3114 controller via PCI, hence there's a bottleneck, can't squeeze more than 15-30 megs write speed (rather 15 today as the xfs partitions on it are brim full and started fragmenting). Now, I'd like to go for a AMD board with 6 sATA channels connected via PCIe - can someone recomend a board here? Preferrably AMD 690 based so I won't need a video card or similar. Dex -- -BEGIN GEEK CODE BLOCK- Version: 3.12 GCS d--(+)@ s-:+ a- C UL++ P+++ L+++ E-- W++ N o? K- w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@ b++(+++) DI+++ D- G++ e* h++ r* y? --END GEEK CODE BLOCK-- http://www.vorratsdatenspeicherung.de - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html That's always the question, which mobo? I went Intel as many of their chipsets (965, p35, x38) have 6 SATA, I am sure AMD have some as well though, what I bought awhile back was a 6 port sata w/ 3 pci-e x1 and 1 pci-e x16. Then you buy the 2 port sata cards (x1) and plugin your drives. Intel means big bucks since I'd need an intel cpu, too. Cheapest lga775 would be around 90 euros where I get a midrange amd x2 at 50-60. Promise also came out with a 4 port PCI-e x1 card but I have not tried it, seen any reviews for it and do not know if it is even supported in linux. Now *that's* Promis-ing (huh huh) - happen to know the model name? http://www.newegg.com/Product/Product.aspx?Item=N82E16816102117 Type SATA / SAS Full blown raid 50 controller. A tad overkill-ish for softraid. I just came across this one: http://geizhals.at/deutschland/a254413.html One would have to have a board featuring pcie 4x or 1x mechanically open at the end. Then again, there's this board: http://geizhals.at/deutschland/a244789.html If that controller runs in Linux those two would make a nice combo. Just saw Adaptec provides open src drivers for Linux, so chances are it's included or at least scheduled. Yeah I heard there are major problems with those (adaptec boards), that is why I went with the open source 2 port sata pci-e cards, work like a charm. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mismatch_cnt != 0
On Sat, 23 Feb 2008, Carlos Carvalho wrote: Justin Piszcz ([EMAIL PROTECTED]) wrote on 23 February 2008 10:44: On Sat, 23 Feb 2008, Justin Piszcz wrote: On Sat, 23 Feb 2008, Michael Tokarev wrote: Justin Piszcz wrote: Should I be worried? Fri Feb 22 20:00:05 EST 2008: Executing RAID health check for /dev/md3... Fri Feb 22 21:00:06 EST 2008: cat /sys/block/md3/md/mismatch_cnt Fri Feb 22 21:00:06 EST 2008: 936 Fri Feb 22 21:00:09 EST 2008: Executing repair on /dev/md3 Fri Feb 22 22:00:10 EST 2008: cat /sys/block/md3/md/mismatch_cnt Fri Feb 22 22:00:10 EST 2008: 936 Your /dev/md3 is a swap, right? If it's swap, it's quite common to see mismatches here. I don't know why, and I don't think it's correct (there should be a bug somewhere). If it's not swap, there should be no mismatches, UNLESS you initially built your array with --assume-clean. In any case it's good to understand where those mismatches comes from in the first place. As of the difference (or, rather, lack thereof) of the mismatched blocks after check and repair - that's exactly what expected. Check found 936 mismatches, and repair corrected exactly the same amount of them. I.e., if you run check again after repair, you should see 0 mismatches. /mjt My /dev/md3 is my main RAID 5 partition. Even after repair, it showed 936, I will re-run repair. Also, I did not build my array with --assume-clean and I run my check array once a week. The only situation where there could be mismatches on a clean array is if you created it with --assume-clean. After a repair, a check should give zero mismatches, without reboot. Of course I'm supposing your hardware is working without glitches... After a reboot check, it is back to 0-- interesting.. Looks like a bug... Which kernel version? Kernel 2.6.24.2 I've seen it on different occasions, for this last time though it may have been due to a power outage that lasted 2hours and obviously the UPS did not hold up that long. Will keep an eye on this to see if any additional mismatches show up. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How many drives are bad?
How many drives actually failed? Failed Devices : 1 On Tue, 19 Feb 2008, Norman Elton wrote: So I had my first failure today, when I got a report that one drive (/dev/sdam) failed. I've attached the output of mdadm --detail. It appears that two drives are listed as removed, but the array is still functioning. What does this mean? How many drives actually failed? This is all a test system, so I can dink around as much as necessary. Thanks for any advice! Norman Elton == OUTPUT OF MDADM = Version : 00.90.03 Creation Time : Fri Jan 18 13:17:33 2008 Raid Level : raid5 Array Size : 6837319552 (6520.58 GiB 7001.42 GB) Device Size : 976759936 (931.51 GiB 1000.20 GB) Raid Devices : 8 Total Devices : 7 Preferred Minor : 4 Persistence : Superblock is persistent Update Time : Mon Feb 18 11:49:13 2008 State : clean, degraded Active Devices : 6 Working Devices : 6 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K UUID : b16bdcaf:a20192fb:39c74cb8:e5e60b20 Events : 0.110 Number Major Minor RaidDevice State 0 6610 active sync /dev/sdag1 1 66 171 active sync /dev/sdah1 2 66 332 active sync /dev/sdai1 3 66 493 active sync /dev/sdaj1 4 66 654 active sync /dev/sdak1 5 005 removed 6 006 removed 7 66 1137 active sync /dev/sdan1 8 66 97- faulty spare /dev/sdam1 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How many drives are bad?
Neil, Is this a bug? Also, I have a question for Norman-- how come your drives are sda[a-z]1? Typically it is /dev/sda1 /dev/sdb1 etc? Justin. On Tue, 19 Feb 2008, Norman Elton wrote: But why do two show up as removed?? I would expect /dev/sdal1 to show up someplace, either active or failed. Any ideas? Thanks, Norman On Feb 19, 2008, at 12:31 PM, Justin Piszcz wrote: How many drives actually failed? Failed Devices : 1 On Tue, 19 Feb 2008, Norman Elton wrote: So I had my first failure today, when I got a report that one drive (/dev/sdam) failed. I've attached the output of mdadm --detail. It appears that two drives are listed as removed, but the array is still functioning. What does this mean? How many drives actually failed? This is all a test system, so I can dink around as much as necessary. Thanks for any advice! Norman Elton == OUTPUT OF MDADM = Version : 00.90.03 Creation Time : Fri Jan 18 13:17:33 2008 Raid Level : raid5 Array Size : 6837319552 (6520.58 GiB 7001.42 GB) Device Size : 976759936 (931.51 GiB 1000.20 GB) Raid Devices : 8 Total Devices : 7 Preferred Minor : 4 Persistence : Superblock is persistent Update Time : Mon Feb 18 11:49:13 2008 State : clean, degraded Active Devices : 6 Working Devices : 6 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K UUID : b16bdcaf:a20192fb:39c74cb8:e5e60b20 Events : 0.110 Number Major Minor RaidDevice State 0 6610 active sync /dev/sdag1 1 66 171 active sync /dev/sdah1 2 66 332 active sync /dev/sdai1 3 66 493 active sync /dev/sdaj1 4 66 654 active sync /dev/sdak1 5 005 removed 6 006 removed 7 66 1137 active sync /dev/sdan1 8 66 97- faulty spare /dev/sdam1 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How many drives are bad?
Norman, I am extremely interested in what distribution you are running on it and what type of SW raid you are employing (besides the one you showed here), are all 48 drives filled, or? Justin. On Tue, 19 Feb 2008, Norman Elton wrote: Justin, This is a Sun X4500 (Thumper) box, so it's got 48 drives inside. /dev/sd[a-z] are all there as well, just in other RAID sets. Once you get to /dev/sdz, it starts up at /dev/sdaa, sdab, etc. I'd be curious if what I'm experiencing is a bug. What should I try to restore the array? Norman On 2/19/08, Justin Piszcz [EMAIL PROTECTED] wrote: Neil, Is this a bug? Also, I have a question for Norman-- how come your drives are sda[a-z]1? Typically it is /dev/sda1 /dev/sdb1 etc? Justin. On Tue, 19 Feb 2008, Norman Elton wrote: But why do two show up as removed?? I would expect /dev/sdal1 to show up someplace, either active or failed. Any ideas? Thanks, Norman On Feb 19, 2008, at 12:31 PM, Justin Piszcz wrote: How many drives actually failed? Failed Devices : 1 On Tue, 19 Feb 2008, Norman Elton wrote: So I had my first failure today, when I got a report that one drive (/dev/sdam) failed. I've attached the output of mdadm --detail. It appears that two drives are listed as removed, but the array is still functioning. What does this mean? How many drives actually failed? This is all a test system, so I can dink around as much as necessary. Thanks for any advice! Norman Elton == OUTPUT OF MDADM = Version : 00.90.03 Creation Time : Fri Jan 18 13:17:33 2008 Raid Level : raid5 Array Size : 6837319552 (6520.58 GiB 7001.42 GB) Device Size : 976759936 (931.51 GiB 1000.20 GB) Raid Devices : 8 Total Devices : 7 Preferred Minor : 4 Persistence : Superblock is persistent Update Time : Mon Feb 18 11:49:13 2008 State : clean, degraded Active Devices : 6 Working Devices : 6 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K UUID : b16bdcaf:a20192fb:39c74cb8:e5e60b20 Events : 0.110 Number Major Minor RaidDevice State 0 6610 active sync /dev/sdag1 1 66 171 active sync /dev/sdah1 2 66 332 active sync /dev/sdai1 3 66 493 active sync /dev/sdaj1 4 66 654 active sync /dev/sdak1 5 005 removed 6 006 removed 7 66 1137 active sync /dev/sdan1 8 66 97- faulty spare /dev/sdam1 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HDD errors in dmesg, but don't know why...
Looks like your replacement disk is no good, the SATA port is bad or other issue. I am not sure what SDB FIS means but as long as you keep getting that error, don't expect the drive to work correctly, I had a drive that did a similar thing (DOA Raptor) and after I got the replacement it worked fine. However, like I said, I am not sure what that error means SDB FIS. On Mon, 18 Feb 2008, Steve Fairbairn wrote: Hi All, I've got a degraded RAID5 which I'm trying to add in the replacement disk. Trouble is, every time the recovery starts, it flies along at 70MB/s or so. Then after doing about 1%, it starts dropping rapidly, until eventually a device is marked failed. When I look in dmesg, I get the following... SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0 ata5.00: (irq_stat 0x00060002, device error via SDB FIS) ata5.00: cmd 60/00:10:3f:0e:f9/01:00:00:00:00/40 tag 2 cdb 0x0 data 131072 in res 41/40:00:50:0e:f9/9c:00:00:00:00/40 Emask 0x9 (media error) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0 ata5.00: (irq_stat 0x00060002, device error via SDB FIS) ata5.00: cmd 60/00:18:3f:02:f9/01:00:00:00:00/40 tag 3 cdb 0x0 data 131072 in res 41/40:00:c3:02:f9/9c:00:00:00:00/40 Emask 0x9 (media error) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0 ata5.00: (irq_stat 0x00060002, device error via SDB FIS) ata5.00: cmd 60/00:10:3f:0e:f9/01:00:00:00:00/40 tag 2 cdb 0x0 data 131072 in res 41/40:00:50:0e:f9/9c:00:00:00:00/40 Emask 0x9 (media error) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0 ata5.00: (irq_stat 0x00060002, device error via SDB FIS) ata5.00: cmd 60/00:18:3f:02:f9/01:00:00:00:00/40 tag 3 cdb 0x0 data 131072 in res 41/40:00:c3:02:f9/9c:00:00:00:00/40 Emask 0x9 (media error) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back ata5.00: exception Emask 0x0 SAct 0x3ff SErr 0x0 action 0x0 ata5.00: (irq_stat 0x00060002, device error via SDB FIS) ata5.00: cmd 60/00:10:3f:0e:f9/01:00:00:00:00/40 tag 2 cdb 0x0 data 131072 in res 41/40:00:50:0e:f9/9c:00:00:00:00/40 Emask 0x9 (media error) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back I've no idea what to make of these errors. As far as I can work out, the HD's themselves are fine They are all less than 2 months old. The box is CentOS 5.1. Linux space.homenet.com 2.6.18-53.1.13.el5 #1 SMP Tue Feb 12 13:02:30 EST 2008 x86_64 x86_64 x86_64 GNU/Linux Any suggestions on what I can do to stop this issue? Steve. No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.20.7/1284 - Release Date: 17/02/2008 14:39 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 how chage chunck size from 64 to 128, 256 ? is it possible ?
When you reate the array its --chunk or -c -- I found 256 KiB to 1024 KiB to be optimal. Justin. On Sat, 9 Feb 2008, Andreas-Sokov wrote: Hi linux-raid. RAID5 how chage chunck size from 64 to 128, 256 ? is it possible ? Somebody did this ? -- Best regards, Andreas-Sokov - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any inexpensive hardware recommendations for PCI interface cards?
On Fri, 8 Feb 2008, Iustin Pop wrote: On Fri, Feb 08, 2008 at 08:54:55AM -0500, Justin Piszcz wrote: The promise tx4 pci works great and supports sata/300+ncq/etc $60-$70. Wait, I have used tx4 pci up until ~2.6.22 and it didn't support AFAIK ncq. Are you sure that current driver supports NCQ? I might then revive that card :) thanks, iustin Whoa nice catch, I meant the Promise 300 TX4 which now retails for $59.99 w/free ship. http://www.newegg.com/Product/Product.aspx?Item=N82E16816102062 Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any inexpensive hardware recommendations for PCI interface cards?
On Fri, 8 Feb 2008, Iustin Pop wrote: On Fri, Feb 08, 2008 at 02:24:15PM -0500, Justin Piszcz wrote: On Fri, 8 Feb 2008, Iustin Pop wrote: On Fri, Feb 08, 2008 at 08:54:55AM -0500, Justin Piszcz wrote: The promise tx4 pci works great and supports sata/300+ncq/etc $60-$70. Wait, I have used tx4 pci up until ~2.6.22 and it didn't support AFAIK ncq. Are you sure that current driver supports NCQ? I might then revive that card :) thanks, iustin Whoa nice catch, I meant the Promise 300 TX4 which now retails for $59.99 w/free ship. http://www.newegg.com/Product/Product.aspx?Item=N82E16816102062 :) Actually, I exactly meant Promise 300 TX4 (the board is in my hand: chip says PDC40718). The HW supports NCQ, but the linux sata_promise driver didn't support NCQ when I tested it. Can someone confirm it does today (2.6.24) NCQ? iustin I used the board with a Seagate 400GiB/NCQ drive and I recall seeing Port Up 3.0Gbps/NCQ (31/32) within the scrolling text upon boot-- but it was awhile ago. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any inexpensive hardware recommendations for PCI interface cards?
On Fri, 8 Feb 2008, Bill Davidsen wrote: Steve Fairbairn wrote: Can anyone see any issues with what I'm trying to do? No. Are there any known issues with IT8212 cards (They worked as straight disks on linux fine)? No idea, don't have that card. Is anyone using an array with disks on PCI interface cards? Works. I've mixed PATA, SATA, onboard, PCI, and firewire (lack of controllers is the mother of invention). As long as the device under the raid works, the raid should work. Is there an issue with mixing motherboard interfaces and PCI card based ones? Not that I've found. Does anyone recommend any inexpensive (probably SATA-II) PCI interface cards? Not I. Large drives have have cured me of FrankenRAID setups recently, other than to build little arrays out of USB devices for backup. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html The promise tx4 pci works great and supports sata/300+ncq/etc $60-$70. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote: On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote: Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 +0100) All the raid10's will have double time for writing, and raid5 and raid6 will also have double or triple writing times, given that you can do striped writes on the raid0. For raid5 and raid6 I think this is even worse. My take is that for raid5 when you write something, you first read the chunk data involved, then you read the parity data, then you xor-subtract the data to be changed, and you xor-add the new data, and then write the new data chunk and the new parity chunk. In total 2 reads and 2 writes. The read/writes happen on the same chunks, so latency is minimized. But in essence it is still 4 IO operations, where it is only 2 writes on raid1/raid10, that is only half the speed for writing on raid5 compared to raid1/10. On raid6 this amounts to 6 IO operations, resulting in 1/3 of the writing speed of raid1/10. I note in passing that there is no difference between xor-subtract and xor-add. Also I assume that you can calculate the parities of both raid5 and raid6 given the old parities chunks and the old and new data chunk. If you have to calculate the new parities by reading all the component data chunks this is going to be really expensive, both in IO and CPU. For a 10 drive raid5 this would involve reading 9 data chunks, and making writes 5 times as expensive as raid1/10. best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html On my benchmarks RAID5 gave the best overall speed with 10 raptors, although I did not play with the various offsets/etc as much as I have tweaked the RAID5. Justin.
Re: recommendations for stripe/chunk size
On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: Hi I am looking at revising our howto. I see a number of places where a chunk size of 32 kiB is recommended, and even recommendations on maybe using sizes of 4 kiB. My own take on that is that this really hurts performance. Normal disks have a rotation speed of between 5400 (laptop) 7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average spinning time for one round of 6 to 12 ms, and average latency of half this, that is 3 to 6 ms. Then you need to add head movement which is something like 2 to 20 ms - in total average seek time 5 to 26 ms, averaging around 13-17 ms. in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 something like between 600 to 1200 kB, actual transfer rates of 80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck, and transfer some data you should have something like 256/512 kiB chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB giving about a time of 20 ms per transaction you should be able with random reads to transfer 12 MB/s - my actual figures is about 30 MB/s which is possibly because of the elevator effect of the file system driver. With a size of 4 kb per chunk you should have a time of 15 ms per transaction, or 66 transactions per second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up the transfer by a factor of 50. I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. I also see that there are some memory constrints on this. Having maybe 1000 processes reading, as for my mirror service, 256 kib buffers would be acceptable, occupying 256 MB RAM. That is reasonable, and I could even tolerate 512 MB ram used. But going to 1 MiB buffers would be overdoing it for my configuration. What would be the recommended chunk size for todays equipment? Best regards Keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html My benchmarks concluded that 256 KiB to 1024 KiB is optimal, too much below or too much over that range results in degradation. Justin.
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: On Tue, Feb 05, 2008 at 11:54:27AM -0500, Justin Piszcz wrote: On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote: On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote: Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 +0100) All the raid10's will have double time for writing, and raid5 and raid6 will also have double or triple writing times, given that you can do striped writes on the raid0. For raid5 and raid6 I think this is even worse. My take is that for raid5 when you write something, you first read the chunk data involved, then you read the parity data, then you xor-subtract the data to be changed, and you xor-add the new data, and then write the new data chunk and the new parity chunk. In total 2 reads and 2 writes. The read/writes happen on the same chunks, so latency is minimized. But in essence it is still 4 IO operations, where it is only 2 writes on raid1/raid10, that is only half the speed for writing on raid5 compared to raid1/10. On raid6 this amounts to 6 IO operations, resulting in 1/3 of the writing speed of raid1/10. I note in passing that there is no difference between xor-subtract and xor-add. Also I assume that you can calculate the parities of both raid5 and raid6 given the old parities chunks and the old and new data chunk. If you have to calculate the new parities by reading all the component data chunks this is going to be really expensive, both in IO and CPU. For a 10 drive raid5 this would involve reading 9 data chunks, and making writes 5 times as expensive as raid1/10. best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html On my benchmarks RAID5 gave the best overall speed with 10 raptors, although I did not play with the various offsets/etc as much as I have tweaked the RAID5. Could you give some figures? I remember testing with bonnie++ and raid10 was about half the speed (200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, but input was closer to RAID5 speeds/did not seem affected (~550MiB/s). Justin.
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: On Tue, Feb 05, 2008 at 05:28:27PM -0500, Justin Piszcz wrote: Could you give some figures? I remember testing with bonnie++ and raid10 was about half the speed (200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, but input was closer to RAID5 speeds/did not seem affected (~550MiB/s). Impressive. What levet of raid10 was involved? and what type of Like I said, it was baseline testing, so pretty much the default raid10 when you create it via mdadm, I did not mess with offsets, etc. equipment, how many disks? Ten 10,000rpm raptors. Maybe the better output for raid5 could be due to some striping - AFAIK raid5 will be striping quite well, and writes almost equal to reading time indicates that the writes are striping too. best regards keld
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
On Mon, 4 Feb 2008, Michael Tokarev wrote: Moshe Yudkowsky wrote: [] If I'm reading the man pages, Wikis, READMEs and mailing lists correctly -- not necessarily the case -- the ext3 file system uses the equivalent of data=journal as a default. ext3 defaults to data=ordered, not data=journal. ext2 doesn't have journal at all. The question then becomes what data scheme to use with reiserfs on the I'd say don't use reiserfs in the first place ;) Another way to phrase this: unless you're running data-center grade hardware and have absolute confidence in your UPS, you should use data=journal for reiserfs and perhaps avoid XFS entirely. By the way, even if you do have a good UPS, there should be some control program for it, to properly shut down your system when UPS loses the AC power. So far, I've seen no such programs... /mjt Why avoid XFS entirely? esandeen, any comments here? Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
On Mon, 4 Feb 2008, Michael Tokarev wrote: Eric Sandeen wrote: [] http://oss.sgi.com/projects/xfs/faq.html#nulls and note that recent fixes have been made in this area (also noted in the faq) Also - the above all assumes that when a drive says it's written/flushed data, that it truly has. Modern write-caching drives can wreak havoc with any journaling filesystem, so that's one good reason for a UPS. If Unfortunately an UPS does not *really* help here. Because unless it has control program which properly shuts system down on the loss of input power, and the battery really has the capacity to power the system while it's shutting down (anyone tested this? With new UPS? and after an year of use, when the battery is not new?), -- unless the UPS actually has the capacity to shutdown system, it will cut the power at an unexpected time, while the disk(s) still has dirty caches... You use nut and a large enough UPS to handle the load of the system, it shuts the machine down just fine. the drive claims to have metadata safe on disk but actually does not, and you lose power, the data claimed safe will evaporate, there's not much the fs can do. IO write barriers address this by forcing the drive to flush order-critical data before continuing; xfs has them on by default, although they are tested at mount time and if you have something in between xfs and the disks which does not support barriers (i.e. lvm...) then they are disabled again, with a notice in the logs. Note also that with linux software raid barriers are NOT supported. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Fri, 18 Jan 2008, Bill Davidsen wrote: Justin Piszcz wrote: On Thu, 17 Jan 2008, Al Boldi wrote: Justin Piszcz wrote: On Wed, 16 Jan 2008, Al Boldi wrote: Also, can you retest using dd with different block-sizes? I can do this, moment.. I know about oflag=direct but I choose to use dd with sync and measure the total time it takes. /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' So I was asked on the mailing list to test dd with various chunk sizes, here is the length of time it took to write 10 GiB and sync per each chunk size: 4=chunk.txt:0:25.46 8=chunk.txt:0:25.63 16=chunk.txt:0:25.26 32=chunk.txt:0:25.08 64=chunk.txt:0:25.55 128=chunk.txt:0:25.26 256=chunk.txt:0:24.72 512=chunk.txt:0:24.71 1024=chunk.txt:0:25.40 2048=chunk.txt:0:25.71 4096=chunk.txt:0:27.18 8192=chunk.txt:0:29.00 16384=chunk.txt:0:31.43 32768=chunk.txt:0:50.11 65536=chunk.txt:2:20.80 What do you get with bs=512,1k,2k,4k,8k,16k... Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html root 4621 0.0 0.0 12404 760 pts/2D+ 17:53 0:00 mdadm -S /dev/md3 root 4664 0.0 0.0 4264 728 pts/5S+ 17:54 0:00 grep D Tried to stop it when it was re-syncing, DEADLOCK :( [ 305.464904] md: md3 still in use. [ 314.595281] md: md_do_sync() got signal ... exiting Anyhow, done testing, time to move data back on if I can kill the resync process w/out deadlock. So does that indicate that there is still a deadlock issue, or that you don't have the latest patches installed? -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark I was trying to stop the raid when it was building, vanilla 2.6.23.14. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Fri, 18 Jan 2008, Greg Cormier wrote: Also, don't use ext*, XFS can be up to 2-3x faster (in many of the benchmarks). I'm going to swap file systems and give it a shot right now! :) How is stability of XFS? I heard recovery is easier with ext2/3 due to more people using it, more tools available, etc? Greg Recovery is actually easier with XFS because the repair filesystem code is built-into the kernel (you dont need a utility to fix it)-- however, there is xfs_repair (if) the in-kernel-tree part could not fix it. I have been using it for 4-5 years? now. Also, with CoRaids (ATA over Ethernet) many of them are above 8TB and ext3 only works up to 8TB, so its not even an option any longer. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Fri, 18 Jan 2008, Greg Cormier wrote: Justin, thanks for the script. Here's my results. I ran it a few times with different tests, hence the small number of results you see here, I slowly trimmed out the obvious not-ideal sizes. Nice, we all love benchmarks!! :) System --- Athlon64 3500 2GB RAM 4x500GB WD Raid editions, raid 5. SDE is the old 4-platter version (5000YS), the others are the 3 platter version. Faster :-) Ok. /dev/sdb: Timing buffered disk reads: 240 MB in 3.00 seconds = 79.91 MB/sec /dev/sdc: Timing buffered disk reads: 248 MB in 3.01 seconds = 82.36 MB/sec /dev/sdd: Timing buffered disk reads: 248 MB in 3.02 seconds = 82.22 MB/sec /dev/sde: (older model, 4 platters instead of 3) Timing buffered disk reads: 210 MB in 3.01 seconds = 69.87 MB/sec /dev/md3: Timing buffered disk reads: 628 MB in 3.00 seconds = 209.09 MB/sec Testing --- Test was : dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync 64-chunka.txt:2:00.63 128-chunka.txt:2:00.20 256-chunka.txt:2:01.67 512-chunka.txt:2:19.90 1024-chunka.txt:2:59.32 For your configuration, a 64-256k chunk seems optimal for this, hypothetical benchmark :) Test was : Unraring multipart RAR's, 1.2 gigabytes. Source and dest drive were the raid array. 64-chunkc.txt:1:04.20 128-chunkc.txt:0:49.37 256-chunkc.txt:0:48.88 512-chunkc.txt:0:41.20 1024-chunkc.txt:0:40.82 1 meg looks like its the best, which is what I use today, 1 MiB chunk offers the best peformance by far, at least with all of my testing (with big files) such as the tests you performed. So, there's a toss up between 256 and 512. Yeah for DD performance, not real-life. If I'm interpreting correctly here, raw throughput is better with 256, but 512 seems to work better with real-world stuff? Look above, 1 MiB got you the fastest unrar time. I'll try to think up another test or two perhaps, and removing 64 as one of the possible options to save time (mke2fs takes a while on 1.5TB) Also, don't use ext*, XFS can be up to 2-3x faster (in many of the benchmarks). Next step will be playing with read aheads and stripe cache sizes I guess! I'm open to any comments/suggestions you guys have! Greg - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
For these benchmarks I timed how long it takes to extract a standard 4.4 GiB DVD: Settings: Software RAID 5 with the following settings (until I change those too): Base setup: blockdev --setra 65536 /dev/md3 echo 16384 /sys/block/md3/md/stripe_cache_size echo Disabling NCQ on all disks... for i in $DISKS do echo Disabling NCQ on $i echo 1 /sys/block/$i/device/queue_depth done p34:~# grep : *chunk* |sort -n 4-chunk.txt:0:45.31 8-chunk.txt:0:44.32 16-chunk.txt:0:41.02 32-chunk.txt:0:40.50 64-chunk.txt:0:40.88 128-chunk.txt:0:40.21 256-chunk.txt:0:40.14*** 512-chunk.txt:0:40.35 1024-chunk.txt:0:41.11 2048-chunk.txt:0:43.89 4096-chunk.txt:0:47.34 8192-chunk.txt:0:57.86 16384-chunk.txt:1:09.39 32768-chunk.txt:1:26.61 It would appear a 256 KiB chunk-size is optimal. So what about NCQ? 1=ncq_depth.txt:0:40.86*** 2=ncq_depth.txt:0:40.99 4=ncq_depth.txt:0:42.52 8=ncq_depth.txt:0:43.57 16=ncq_depth.txt:0:42.54 31=ncq_depth.txt:0:42.51 Keeping it off seems best. 1=stripe_and_read_ahead.txt:0:40.86 2=stripe_and_read_ahead.txt:0:40.99 4=stripe_and_read_ahead.txt:0:42.52 8=stripe_and_read_ahead.txt:0:43.57 16=stripe_and_read_ahead.txt:0:42.54 31=stripe_and_read_ahead.txt:0:42.51 256=stripe_and_read_ahead.txt:1:44.16 1024=stripe_and_read_ahead.txt:1:07.01 2048=stripe_and_read_ahead.txt:0:53.59 4096=stripe_and_read_ahead.txt:0:45.66 8192=stripe_and_read_ahead.txt:0:40.73 16384=stripe_and_read_ahead.txt:0:38.99** 16384=stripe_and_65536_read_ahead.txt:0:38.67 16384=stripe_and_65536_read_ahead.txt:0:38.69 (again, this is what I use from earlier benchmarks) 32768=stripe_and_read_ahead.txt:0:38.84 What about logbufs? 2=logbufs.txt:0:39.21 4=logbufs.txt:0:39.24 8=logbufs.txt:0:38.71 (again) 2=logbufs.txt:0:42.16 4=logbufs.txt:0:38.79 8=logbufs.txt:0:38.71** (yes) What about logbsize? 16k=logbsize.txt:1:09.22 32k=logbsize.txt:0:38.70 64k=logbsize.txt:0:39.04 128k=logbsize.txt:0:39.06 256k=logbsize.txt:0:38.59** (best) What about allocsize? (default=1024k) 4k=allocsize.txt:0:39.35 8k=allocsize.txt:0:38.95 16k=allocsize.txt:0:38.79 32k=allocsize.txt:0:39.71 64k=allocsize.txt:1:09.67 128k=allocsize.txt:0:39.04 256k=allocsize.txt:0:39.11 512k=allocsize.txt:0:39.01 1024k=allocsize.txt:0:38.75** (default) 2048k=allocsize.txt:0:39.07 4096k=allocsize.txt:0:39.15 8192k=allocsize.txt:0:39.40 16384k=allocsize.txt:0:39.36 What about the agcount? 2=agcount.txt:0:37.53 4=agcount.txt:0:38.56 8=agcount.txt:0:40.86 16=agcount.txt:0:39.05 32=agcount.txt:0:39.07** (default) 64=agcount.txt:0:39.29 128=agcount.txt:0:39.42 256=agcount.txt:0:38.76 512=agcount.txt:0:38.27 1024=agcount.txt:0:38.29 2048=agcount.txt:1:08.55 4096=agcount.txt:0:52.65 8192=agcount.txt:1:06.96 16384=agcount.txt:1:31.21 32768=agcount.txt:1:09.06 65536=agcount.txt:1:54.96 So far I have: p34:~# mkfs.xfs -f -l lazy-count=1,version=2,size=128m -i attr=2 /dev/md3 meta-data=/dev/md3 isize=256agcount=32, agsize=10302272 blks = sectsz=4096 attr=2 data = bsize=4096 blocks=329671296, imaxpct=25 = sunit=64 swidth=576 blks, unwritten=1 naming =version 2 bsize=4096 log =internal log bsize=4096 blocks=32768, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=2359296 blocks=0, rtextents=0 p34:~# grep /dev/md3 /etc/fstab /dev/md3/r1 xfs noatime,nodiratime,logbufs=8,logbsize=262144 0 1 Notice how mkfs.xfs 'knows' the sunit and swidth, and it is the correct units too because it is software raid, and it pulls this information from that layer, unlike HW raid which will not have a clue of what is underneath and say sunit=0,swidth=0. However, in earlier testing I actually made them both 0 and it actually made performance better: http://home.comcast.net/~jpiszcz/sunit-swidth/results.html In any case, I am re-running bonnie++ once more with a 256 KiB chunk and will compare to those values in a bit. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Wed, 16 Jan 2008, Justin Piszcz wrote: For these benchmarks I timed how long it takes to extract a standard 4.4 GiB DVD: Settings: Software RAID 5 with the following settings (until I change those too): http://home.comcast.net/~jpiszcz/sunit-swidth/newresults.html Any idea why an sunit and swidth of 0 (and -d agcount=4) is faster at least with sequential input/output than the proper sunit/swidth that it should be? It does not make sense. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Wed, 16 Jan 2008, Al Boldi wrote: Justin Piszcz wrote: For these benchmarks I timed how long it takes to extract a standard 4.4 GiB DVD: Settings: Software RAID 5 with the following settings (until I change those too): Base setup: blockdev --setra 65536 /dev/md3 echo 16384 /sys/block/md3/md/stripe_cache_size echo Disabling NCQ on all disks... for i in $DISKS do echo Disabling NCQ on $i echo 1 /sys/block/$i/device/queue_depth done p34:~# grep : *chunk* |sort -n 4-chunk.txt:0:45.31 8-chunk.txt:0:44.32 16-chunk.txt:0:41.02 32-chunk.txt:0:40.50 64-chunk.txt:0:40.88 128-chunk.txt:0:40.21 256-chunk.txt:0:40.14*** 512-chunk.txt:0:40.35 1024-chunk.txt:0:41.11 2048-chunk.txt:0:43.89 4096-chunk.txt:0:47.34 8192-chunk.txt:0:57.86 16384-chunk.txt:1:09.39 32768-chunk.txt:1:26.61 It would appear a 256 KiB chunk-size is optimal. Can you retest with different max_sectors_kb on both md and sd? Remember this is SW RAID, so max_sectors_kb will only affect the individual disks underneath the SW RAID, I have benchmarked in the past, the defaults chosen by the kernel are optimal, changing them did not make any noticable improvements. Also, can you retest using dd with different block-sizes? I can do this, moment.. I know about oflag=direct but I choose to use dd with sync and measure the total time it takes. /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' So I was asked on the mailing list to test dd with various chunk sizes, here is the length of time it took to write 10 GiB and sync per each chunk size: 4=chunk.txt:0:25.46 8=chunk.txt:0:25.63 16=chunk.txt:0:25.26 32=chunk.txt:0:25.08 64=chunk.txt:0:25.55 128=chunk.txt:0:25.26 256=chunk.txt:0:24.72 512=chunk.txt:0:24.71 1024=chunk.txt:0:25.40 2048=chunk.txt:0:25.71 4096=chunk.txt:0:27.18 8192=chunk.txt:0:29.00 16384=chunk.txt:0:31.43 32768=chunk.txt:0:50.11 65536=chunk.txt:2:20.80 Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Wed, 16 Jan 2008, Greg Cormier wrote: What sort of tools are you using to get these benchmarks, and can I used them for ext3? Very interested in running this on my server. Thanks, Greg You can use whatever suits you, such as untar kernel source tree, copy files, untar backups, etc--, you should benchmark specifically what *your* workload is. Here is the skeleton, using bash:: (don't forget to turn off the cron daemon) for i in 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 do cd / umount /r1 mdadm -S /dev/md3 mdadm --create --assume-clean --verbose /dev/md3 --level=5 --raid-devices=10 --chunk=$i --run /dev/sd[c-l]1 /etc/init.d/oraid.sh # to optimize my raid stuff mkfs.xfs -f /dev/md3 mount /dev/md3 /r1 -o logbufs=8,logbsize=262144 # then simply add what you do often here # everyone's workload is different /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' done Then just, grep : /root/*chunk* | sort -n to get the results in the same format. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Thu, 17 Jan 2008, Al Boldi wrote: Justin Piszcz wrote: On Wed, 16 Jan 2008, Al Boldi wrote: Also, can you retest using dd with different block-sizes? I can do this, moment.. I know about oflag=direct but I choose to use dd with sync and measure the total time it takes. /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' So I was asked on the mailing list to test dd with various chunk sizes, here is the length of time it took to write 10 GiB and sync per each chunk size: 4=chunk.txt:0:25.46 8=chunk.txt:0:25.63 16=chunk.txt:0:25.26 32=chunk.txt:0:25.08 64=chunk.txt:0:25.55 128=chunk.txt:0:25.26 256=chunk.txt:0:24.72 512=chunk.txt:0:24.71 1024=chunk.txt:0:25.40 2048=chunk.txt:0:25.71 4096=chunk.txt:0:27.18 8192=chunk.txt:0:29.00 16384=chunk.txt:0:31.43 32768=chunk.txt:0:50.11 65536=chunk.txt:2:20.80 What do you get with bs=512,1k,2k,4k,8k,16k... Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Done testing for now, but I did test with 256k with a 256k chunk and obviously that got good results, just like 1m with a 1mb chunk, 460-480 MiB/s. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
On Thu, 17 Jan 2008, Al Boldi wrote: Justin Piszcz wrote: On Wed, 16 Jan 2008, Al Boldi wrote: Also, can you retest using dd with different block-sizes? I can do this, moment.. I know about oflag=direct but I choose to use dd with sync and measure the total time it takes. /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' So I was asked on the mailing list to test dd with various chunk sizes, here is the length of time it took to write 10 GiB and sync per each chunk size: 4=chunk.txt:0:25.46 8=chunk.txt:0:25.63 16=chunk.txt:0:25.26 32=chunk.txt:0:25.08 64=chunk.txt:0:25.55 128=chunk.txt:0:25.26 256=chunk.txt:0:24.72 512=chunk.txt:0:24.71 1024=chunk.txt:0:25.40 2048=chunk.txt:0:25.71 4096=chunk.txt:0:27.18 8192=chunk.txt:0:29.00 16384=chunk.txt:0:31.43 32768=chunk.txt:0:50.11 65536=chunk.txt:2:20.80 What do you get with bs=512,1k,2k,4k,8k,16k... Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html root 4621 0.0 0.0 12404 760 pts/2D+ 17:53 0:00 mdadm -S /dev/md3 root 4664 0.0 0.0 4264 728 pts/5S+ 17:54 0:00 grep D Tried to stop it when it was re-syncing, DEADLOCK :( [ 305.464904] md: md3 still in use. [ 314.595281] md: md_do_sync() got signal ... exiting Anyhow, done testing, time to move data back on if I can kill the resync process w/out deadlock. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
How do I get rid of old device?
p34:~# mdadm /dev/md3 --zero-superblock p34:~# mdadm --examine --scan ARRAY /dev/md0 level=raid1 num-devices=2 UUID=f463057c:9a696419:3bcb794a:7aaa12b2 ARRAY /dev/md1 level=raid1 num-devices=2 UUID=98e4948c:c6685f82:e082fd95:e7f45529 ARRAY /dev/md2 level=raid1 num-devices=2 UUID=330c9879:73af7d3e:57f4c139:f9191788 ARRAY /dev/md3 level=raid0 num-devices=10 UUID=6dc12c36:b3517ff9:083fb634:68e9eb49 p34:~# I cannot seem to get rid of /dev/md3, its almost as if there is a piece of it on the root (2) disks or reference to it? I also dd'd the other 10 disks (non-root) and /dev/md3 persists. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I get rid of old device?
On Wed, 16 Jan 2008, Justin Piszcz wrote: p34:~# mdadm /dev/md3 --zero-superblock p34:~# mdadm --examine --scan ARRAY /dev/md0 level=raid1 num-devices=2 UUID=f463057c:9a696419:3bcb794a:7aaa12b2 ARRAY /dev/md1 level=raid1 num-devices=2 UUID=98e4948c:c6685f82:e082fd95:e7f45529 ARRAY /dev/md2 level=raid1 num-devices=2 UUID=330c9879:73af7d3e:57f4c139:f9191788 ARRAY /dev/md3 level=raid0 num-devices=10 UUID=6dc12c36:b3517ff9:083fb634:68e9eb49 p34:~# I cannot seem to get rid of /dev/md3, its almost as if there is a piece of it on the root (2) disks or reference to it? I also dd'd the other 10 disks (non-root) and /dev/md3 persists. Hopefully this will clear it out: p34:~# for i in /dev/sd[c-l]; do /usr/bin/time dd if=/dev/zero of=$i bs=1M done [1] 4625 [2] 4626 [3] 4627 [4] 4628 [5] 4629 [6] 4630 [7] 4631 [8] 4632 [9] 4633 [10] 4634 p34:~# Good aggregate bandwidth at least writing to all 10 disks. procs ---memory-- ---swap-- -io -system--cpu r b swpd free buff cache si sobibo in cs us sy id wa 1 9 0 46472 7201008 7342400 0 658756 2339 2242 0 22 24 54 3 10 0 44132 7204680 7329200 0 660040 2335 2276 0 22 19 59 5 8 0 48196 7201840 7373600 0 652708 2403 1645 0 23 11 66 2 9 0 45728 7205036 7262800 0 659844 2296 1891 0 23 11 66 0 11 0 47672 7202992 7256400 0 672856 2327 1616 0 22 7 71 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New XFS benchmarks using David Chinner's recommendations for XFS-based optimizations.
On Fri, 4 Jan 2008, Changliang Chen wrote: Hi Justin£¬ From your report£¬It looks that the p34-default's behavior is better£¬which item make you consider that the p34-dchinner looks nice£¿ -- Best Regards The re-write and sequential input and output is faster for dchinner. Justin.
Re: Change Stripe size?
On Mon, 31 Dec 2007, Greg Cormier wrote: So I've been slowly expanding my knowledge of mdadm/linux raid. I've got a 1 terabyte array which stores mostly large media files, and from my reading, increasing the stripe size should really help my performance Is there any way to do this to an existing array, or will I need to backup the array and re-create it with a larger stripe size? Thanks and happy new year! Greg - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Backup the array and re-create it is currently the only way with sw raid AFAIK.. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
AS/CFQ/DEADLINE/NOOP on Linux SW RAID?
When setting the scheduler, is it possible to set it on /dev/mdX or is it only possible to set it on the underlying devices which compose the sw raid device? /dev/sda /dev/sdb and does that really affect how the data is accessed by specifying the underlying device and not mdX? Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
On Sat, 29 Dec 2007, dean gaudet wrote: On Tue, 25 Dec 2007, Bill Davidsen wrote: The issue I'm thinking about is hardware sector size, which on modern drives may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle when writing a 512b block. i'm not sure any shipping SATA disks have larger than 512B sectors yet... do you know of any? (or is this thread about SCSI which i don't pay attention to...) on a brand new WDC WD7500AAKS-00RBA0 with this partition layout: 255 heads, 63 sectors/track, 91201 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes so sda1 starts at a non-multiple of 4096 into the disk. i ran some random seek+write experiments using http://arctic.org/~dean/randomio/, here are the results using 512 byte and 4096 byte writes (fsync after each write), 8 threads, on sda1: # ./randomio /dev/sda1 8 1 1 512 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 148.5 |0.0 infnan0.0nan | 148.5 0.2 53.7 89.3 19.5 129.2 |0.0 infnan0.0nan | 129.2 37.2 61.9 96.79.3 131.2 |0.0 infnan0.0nan | 131.2 40.3 61.0 90.49.3 132.0 |0.0 infnan0.0nan | 132.0 39.6 60.6 89.39.1 130.7 |0.0 infnan0.0nan | 130.7 39.8 61.3 98.18.9 131.4 |0.0 infnan0.0nan | 131.4 40.0 60.8 101.09.6 # ./randomio /dev/sda1 8 1 1 4096 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 141.7 |0.0 infnan0.0nan | 141.7 0.3 56.3 99.3 21.1 132.4 |0.0 infnan0.0nan | 132.4 43.3 60.4 91.88.5 131.6 |0.0 infnan0.0nan | 131.6 41.4 60.9 111.09.6 131.8 |0.0 infnan0.0nan | 131.8 41.4 60.7 85.38.6 130.6 |0.0 infnan0.0nan | 130.6 41.7 61.3 95.09.4 131.4 |0.0 infnan0.0nan | 131.4 42.2 60.8 90.58.4 i think the anomalous results in the first 10s samples are perhaps the drive coming out of a standby state. and here are the results aligned using the sda raw device itself: # ./randomio /dev/sda 8 1 1 512 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 147.3 |0.0 infnan0.0nan | 147.3 0.3 54.1 93.7 20.1 132.4 |0.0 infnan0.0nan | 132.4 37.4 60.6 91.89.2 132.5 |0.0 infnan0.0nan | 132.5 37.7 60.3 93.79.3 131.8 |0.0 infnan0.0nan | 131.8 39.4 60.7 92.79.0 133.9 |0.0 infnan0.0nan | 133.9 41.7 59.8 90.78.5 130.2 |0.0 infnan0.0nan | 130.2 40.8 61.5 88.68.9 # ./randomio /dev/sda 8 1 1 4096 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 145.4 |0.0 infnan0.0nan | 145.4 0.3 54.9 94.0 20.1 130.3 |0.0 infnan0.0nan | 130.3 36.0 61.4 92.79.6 130.6 |0.0 infnan0.0nan | 130.6 38.2 61.2 96.79.2 132.1 |0.0 infnan0.0nan | 132.1 39.0 60.5 93.59.2 131.8 |0.0 infnan0.0nan | 131.8 43.1 60.8 93.89.1 129.0 |0.0 infnan0.0nan | 129.0 40.2 62.0 96.48.8 it looks pretty much the same to me... -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Good to know/have it confirmed by someone else, the alignment does not matter with Linux/SW RAID. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, dean gaudet wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. i'm doing some more isolation but just grabbing kernels i have precompiled so far -- a 2.6.19.7 kernel doesn't show the problem, and early indications are a 2.6.21.7 kernel also doesn't have the problem but i'm giving it longer to show its head. i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just so we get the debian patches out of the way. i was tempted to blame async api because it's newish :) but according to the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async API, and it still hung, so async is probably not to blame. anyhow the test case i'm using is the dma_thrasher script i attached... it takes about an hour to give me confidence there's no problems so this will take a while. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Dean, Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with? The script you sent out earlier, you are able to reproduce it easily with 31 or so kernel tar decompressions? Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Thu, 27 Dec 2007, dean gaudet wrote: hey neil -- remember that raid5 hang which me and only one or two others ever experienced and which was hard to reproduce? we were debugging it well over a year ago (that box has 400+ day uptime now so at least that long ago :) the workaround was to increase stripe_cache_size... i seem to have a way to reproduce something which looks much the same. setup: - 2.6.24-rc6 - system has 8GiB RAM but no swap - 8x750GB in a raid5 with one spare, chunksize 1024KiB. - mkfs.xfs default options - mount -o noatime - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440 that sequence hangs for me within 10 seconds... and i can unhang / rehang it by toggling between stripe_cache_size 256 and 1024. i detect the hang by watching iostat -kx /dev/sd? 5. i've attached the kernel log where i dumped task and timer state while it was hung... note that you'll see at some point i did an xfs mount with external journal but it happens with internal journal as well. looks like it's using the raid456 module and async api. anyhow let me know if you need more info / have any suggestions. -dean With that high of a stripe size the stripe_cache_size needs to be greater than the default to handle it. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
On Thu, 20 Dec 2007, Bill Davidsen wrote: Justin Piszcz wrote: On Wed, 19 Dec 2007, Bill Davidsen wrote: I'm going to try another approach, I'll describe it when I get results (or not). http://home.comcast.net/~jpiszcz/align_vs_noalign/ Hardly any difference at whatsoever, only on the per char for read/write is it any faster..? Am I misreading what you are doing here... you have the underlying data on the actual hardware devices 64k aligned by using either the whole device or starting a partition on a 64k boundary? I'm dubious that you will see a difference any other way, after all the translations take place. I'm trying creating a raid array using loop devices created with the offset parameter, but I suspect that I will wind up doing a test after just repartitioning the drives, painful as that will be. Average of 3 runs taken: $ cat align/*log|grep , p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:10:16/64,1334210,10,330,2,120,1,3978,10,312,2 p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:10:16/64,1252548,6,296,1,115,1,7927,20,373,2 p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:10:16/64,1242445,6,303,1,117,1,6767,17,359,2 $ cat noalign/*log|grep , p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:10:16/64,1353180,8,314,1,117,1,8684,22,283,2 p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:10:16/64,12211519,29,297,1,113,1,3218,8,325,2 p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:10:16/64,1243229,8,307,1,120,1,4247,11,313,2 -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark 1. The first I made partitions on each drive like I normally do. 2. The second test was I followed the EMC document on how to properly align the partitions and I followed Microsoft's document on how to calculate the correct offset, I used 512 for 256k stripe. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Linux RAID Partition Offset 63 cylinders / 30% performance hit?
The (up to) 30% percent figure is mentioned here: http://insights.oetiker.ch/linux/raidoptimization.html On http://forums.storagereview.net/index.php?showtopic=25786: This user writes about the problem: XP, and virtually every O/S and partitioning software of XP's day, by default places the first partition on a disk at sector 63. Being an odd number, and 31.5KB into the drive, it isn't ever going to align with any stripe size. This is an unfortunate industry standard. Vista on the other hand, aligns the first partition on sector 2048 by default as a by-product of it's revisions to support large-sector sized hard drives. As RAID5 arrays in write mode mimick the performance characteristics of large-sector size hard drives, this comes as a great if not inadvertent benefit. 2048 is evenly divisible by 2 and 4 (allowing for 3 and 5 drive arrays optimally) and virtually every stripe size in common use. If you are however using a 4-drive RAID5, you're SOOL. Page 9 in this PDF (EMC_BestPractice_R22.pdf) shows the problem graphically: http://bbs.doit.com.cn/attachment.php?aid=6757 -- Now to my setup / question: # fdisk -l /dev/sdc Disk /dev/sdc: 150.0 GB, 150039945216 bytes 255 heads, 63 sectors/track, 18241 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x5667c24a Device Boot Start End Blocks Id System /dev/sdc1 1 18241 146520801 fd Linux raid autodetect --- If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct start and end size if I wanted to make sure the RAID5 was stripe aligned? Or is there a better way to do this, does parted handle this situation better? What is the best (and correct) way to calculate stripe-alignment on the RAID5 device itself? --- The EMC paper recommends: Disk partition adjustment for Linux systems In Linux, align the partition table before data is written to the LUN, as the partition map will be rewritten and all data on the LUN destroyed. In the following example, the LUN is mapped to /dev/emcpowerah, and the LUN stripe element size is 128 blocks. Arguments for the fdisk utility are as follows: fdisk/dev/emcpowerah x # expert mode b # adjust starting block number 1 # choose partition 1 128 #set it to 128, our stripe element size w # write the new partition --- Does this also apply to Linux/SW RAID5? Or are there any caveats that are not taken into account since it is based in SW vs. HW? --- What it currently looks like: Command (m for help): x Expert command (m for help): p Disk /dev/sdc: 255 heads, 63 sectors, 18241 cylinders Nr AF Hd Sec Cyl Hd Sec Cyl Start Size ID 1 00 1 10 254 63 1023 63 293041602 fd 2 00 0 00 0 00 0 0 00 3 00 0 00 0 00 0 0 00 4 00 0 00 0 00 0 0 00 Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Wed, 19 Dec 2007, Bill Davidsen wrote: Thiemo Nagel wrote: Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s Interesting. Any ideas what could be the reason? How much do you get from a single drive? -- The Samsung HD501LJ that I'm using gives ~84MB/s when reading from the beginning of the disk. With RAID 5 I'm getting slightly better results (though I really wonder why, since naively I would expect identical read performance) but that does only account for a small part of the difference: 16k read64k write chunk sizeRAID 5RAID 6RAID 5RAID 6 128k492497268270 256k615530288270 512k625607230174 1024k 65062017075 What is your stripe cache size? # Set stripe-cache_size for RAID5. echo Setting stripe_cache_size to 16 MiB for /dev/md3 echo 16384 /sys/block/md3/md/stripe_cache_size Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
On Wed, 19 Dec 2007, Mattias Wadenstein wrote: On Wed, 19 Dec 2007, Justin Piszcz wrote: -- Now to my setup / question: # fdisk -l /dev/sdc Disk /dev/sdc: 150.0 GB, 150039945216 bytes 255 heads, 63 sectors/track, 18241 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x5667c24a Device Boot Start End Blocks Id System /dev/sdc1 1 18241 146520801 fd Linux raid autodetect --- If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct start and end size if I wanted to make sure the RAID5 was stripe aligned? Or is there a better way to do this, does parted handle this situation better? From that setup it seems simple, scrap the partition table and use the disk device for raid. This is what we do for all data storage disks (hw raid) and sw raid members. /Mattias Wadenstein Is there any downside to doing that? I remember when I had to take my machine apart for a BIOS downgrade when I plugged in the sata devices again I did not plug them back in the same order, everything worked of course but when I ran LILO it said it was not part of the RAID set, because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, if I had not used partitions here, I'd have lost (or more of the drives) due to a bad LILO run? Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
On Wed, 19 Dec 2007, Jon Nelson wrote: On 12/19/07, Justin Piszcz [EMAIL PROTECTED] wrote: On Wed, 19 Dec 2007, Mattias Wadenstein wrote: From that setup it seems simple, scrap the partition table and use the disk device for raid. This is what we do for all data storage disks (hw raid) and sw raid members. /Mattias Wadenstein Is there any downside to doing that? I remember when I had to take my There is one (just pointed out to me yesterday): having the partition and having it labeled as raid makes identification quite a bit easier for humans and software, too. -- Jon Some nice graphs found here: http://sqlblog.com/blogs/linchi_shea/archive/2007/02/01/performance-impact-of-disk-misalignment.aspx - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
On Wed, 19 Dec 2007, Bill Davidsen wrote: Justin Piszcz wrote: On Wed, 19 Dec 2007, Mattias Wadenstein wrote: On Wed, 19 Dec 2007, Justin Piszcz wrote: -- Now to my setup / question: # fdisk -l /dev/sdc Disk /dev/sdc: 150.0 GB, 150039945216 bytes 255 heads, 63 sectors/track, 18241 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x5667c24a Device Boot Start End Blocks Id System /dev/sdc1 1 18241 146520801 fd Linux raid autodetect --- If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct start and end size if I wanted to make sure the RAID5 was stripe aligned? Or is there a better way to do this, does parted handle this situation better? From that setup it seems simple, scrap the partition table and use the disk device for raid. This is what we do for all data storage disks (hw raid) and sw raid members. /Mattias Wadenstein Is there any downside to doing that? I remember when I had to take my machine apart for a BIOS downgrade when I plugged in the sata devices again I did not plug them back in the same order, everything worked of course but when I ran LILO it said it was not part of the RAID set, because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, if I had not used partitions here, I'd have lost (or more of the drives) due to a bad LILO run? As other posts have detailed, putting the partition on a 64k aligned boundary can address the performance problems. However, a poor choice of chunk size, cache_buffer size, or just random i/o in small sizes can eat up a lot of the benefit. I don't think you need to give up your partitions to get the benefit of alignment. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark Hrmm.. I am doing a benchmark now with: 6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup. unligned, just fdisk /dev/sdc, mkpartition, fd raid. aligned, fdisk, expert, start at 512 as the off-set Per a Microsoft KB: Example of alignment calculations in kilobytes for a 256-KB stripe unit size: (63 * .5) / 256 = 0.123046875 (64 * .5) / 256 = 0.125 (128 * .5) / 256 = 0.25 (256 * .5) / 256 = 0.5 (512 * .5) / 256 = 1 These examples shows that the partition is not aligned correctly for a 256-KB stripe unit size until the partition is created by using an offset of 512 sectors (512 bytes per sector). So I should start at 512 for a 256k chunk size. I ran bonnie++ three consecutive times and took the average for the unaligned, rebuilding the RAID5 now and then I will re-execute the test 3 additional times and take the average of that. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help diagnosing bad disk
On Wed, 19 Dec 2007, Jon Sabo wrote: So I was trying to copy over some Indiana Jones wav files and it wasn't going my way. I noticed that my software raid device showed: /dev/md1 on / type ext3 (rw,errors=remount-ro) Is this saying that it was remounted, read only because it found a problem with the md1 meta device? That's what it looks like it's saying but I can still write to /. mdadm --detail showed: [EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Mon Jul 30 21:47:14 2007 Raid Level : raid1 Array Size : 1951744 ( 1906.32 MiB 1998.59 MB) Device Size : 1951744 (1906.32 MiB 1998.59 MB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Dec 19 12:59:56 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 157f716c:0e7aebca:c20741f6 :bb6099c9 Events : 0.28 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 001 removed [EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Mon Jul 30 21:47:47 2007 Raid Level : raid1 Array Size : 974808064 (929.65 GiB 998.20 GB) Device Size : 974808064 (929.65 GiB 998.20 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Wed Dec 19 13:14:53 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 156a030e:9a6f8eb3:9b0c439e:d718e744 Events : 0.1990 Number Major Minor RaidDevice State 0 820 active sync /dev/sda2 1 001 removed I have two 1 terabyte sata drives in this box. From what I was reading wouldn't it show an F for the failed drive? I thought I would see that /dev/sdb1 and /dev/sdb2 were failed and it would show an F. What is this saying and how do you know that its /dev/sdb and not some other drive? It shows removed and that the state is clean, degraded. Is that something you can recover from with out returning this disk and putting in a new one to add to the raid1 array? mdadm /dev/md1 -a /dev/sdb2 to re-add it back into the array What does cat /proc/mdstat show? I would also show us: smartctl -a /dev/sdb Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help diagnosing bad disk
On Wed, 19 Dec 2007, Jon Sabo wrote: I found the problem. The power was unplugged from the drive. The sata power connectors aren't very good at securing the connector. I reattached the power connector to the sata drive and booted up. This is what it looks like now: [EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Mon Jul 30 21:47:14 2007 Raid Level : raid1 Array Size : 1951744 (1906.32 MiB 1998.59 MB) Device Size : 1951744 (1906.32 MiB 1998.59 MB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Dec 19 13:48:12 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 157f716c:0e7aebca:c20741f6:bb6099c9 Events : 0.44 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 001 removed [EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Mon Jul 30 21:47:47 2007 Raid Level : raid1 Array Size : 974808064 (929.65 GiB 998.20 GB) Device Size : 974808064 (929.65 GiB 998.20 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Wed Dec 19 13:50:02 2007 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 156a030e:9a6f8eb3:9b0c439e:d718e744 Events : 0.1498340 Number Major Minor RaidDevice State 0 000 removed 1 8 181 active sync /dev/sdb2 How do I put it back into the correct state? Thanks! mdadm /dev/md0 -a /dev/sdb1 mdadm /dev/md1 -a /dev/sda1 Weird that they got out out of sync on different drives. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
On Wed, 19 Dec 2007, Bill Davidsen wrote: Justin Piszcz wrote: On Wed, 19 Dec 2007, Bill Davidsen wrote: Justin Piszcz wrote: On Wed, 19 Dec 2007, Mattias Wadenstein wrote: On Wed, 19 Dec 2007, Justin Piszcz wrote: -- Now to my setup / question: # fdisk -l /dev/sdc Disk /dev/sdc: 150.0 GB, 150039945216 bytes 255 heads, 63 sectors/track, 18241 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x5667c24a Device Boot Start End Blocks Id System /dev/sdc1 1 18241 146520801 fd Linux raid autodetect --- If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct start and end size if I wanted to make sure the RAID5 was stripe aligned? Or is there a better way to do this, does parted handle this situation better? From that setup it seems simple, scrap the partition table and use the disk device for raid. This is what we do for all data storage disks (hw raid) and sw raid members. /Mattias Wadenstein Is there any downside to doing that? I remember when I had to take my machine apart for a BIOS downgrade when I plugged in the sata devices again I did not plug them back in the same order, everything worked of course but when I ran LILO it said it was not part of the RAID set, because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, if I had not used partitions here, I'd have lost (or more of the drives) due to a bad LILO run? As other posts have detailed, putting the partition on a 64k aligned boundary can address the performance problems. However, a poor choice of chunk size, cache_buffer size, or just random i/o in small sizes can eat up a lot of the benefit. I don't think you need to give up your partitions to get the benefit of alignment. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark Hrmm.. I am doing a benchmark now with: 6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup. unligned, just fdisk /dev/sdc, mkpartition, fd raid. aligned, fdisk, expert, start at 512 as the off-set Per a Microsoft KB: Example of alignment calculations in kilobytes for a 256-KB stripe unit size: (63 * .5) / 256 = 0.123046875 (64 * .5) / 256 = 0.125 (128 * .5) / 256 = 0.25 (256 * .5) / 256 = 0.5 (512 * .5) / 256 = 1 These examples shows that the partition is not aligned correctly for a 256-KB stripe unit size until the partition is created by using an offset of 512 sectors (512 bytes per sector). So I should start at 512 for a 256k chunk size. I ran bonnie++ three consecutive times and took the average for the unaligned, rebuilding the RAID5 now and then I will re-execute the test 3 additional times and take the average of that. I'm going to try another approach, I'll describe it when I get results (or not). Waiting for the raid to rebuild then I will re-run thereafter. [=...] recovery = 86.7% (339104640/390708480) finish=30.8min speed=27835K/sec ... - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
On Wed, 19 Dec 2007, Bill Davidsen wrote: I'm going to try another approach, I'll describe it when I get results (or not). http://home.comcast.net/~jpiszcz/align_vs_noalign/ Hardly any difference at whatsoever, only on the per char for read/write is it any faster..? Average of 3 runs taken: $ cat align/*log|grep , p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:10:16/64,1334210,10,330,2,120,1,3978,10,312,2 p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:10:16/64,1252548,6,296,1,115,1,7927,20,373,2 p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:10:16/64,1242445,6,303,1,117,1,6767,17,359,2 $ cat noalign/*log|grep , p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:10:16/64,1353180,8,314,1,117,1,8684,22,283,2 p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:10:16/64,12211519,29,297,1,113,1,3218,8,325,2 p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:10:16/64,1243229,8,307,1,120,1,4247,11,313,2 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
On Wed, 19 Dec 2007, Robin Hill wrote: On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote: The (up to) 30% percent figure is mentioned here: http://insights.oetiker.ch/linux/raidoptimization.html That looks to be referring to partitioning a RAID device - this'll only apply to hardware RAID or partitionable software RAID, not to the normal use case. When you're creating an array out of standard partitions then you know the array stripe size will align with the disks (there's no way it cannot), and you can set the filesystem stripe size to align as well (XFS will do this automatically). I've actually done tests on this with hardware RAID to try to find the correct partition offset, but wasn't able to see any difference (using bonnie++ and moving the partition start by one sector at a time). # fdisk -l /dev/sdc Disk /dev/sdc: 150.0 GB, 150039945216 bytes 255 heads, 63 sectors/track, 18241 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x5667c24a Device Boot Start End Blocks Id System /dev/sdc1 1 18241 146520801 fd Linux raid autodetect This looks to be a normal disk - the partition offsets shouldn't be relevant here (barring any knowledge of the actual physical disk layout anyway, and block remapping may well make that rather irrelevant). That's my take on this one anyway. Cheers, Robin -- ___ ( ' } | Robin Hill[EMAIL PROTECTED] | / / ) | Little Jim says | // !! | He fallen in de water !! | Interesting, yes, I am using XFS as well, thanks for the response. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Tue, 18 Dec 2007, Norman Elton wrote: We're investigating the possibility of running Linux (RHEL) on top of Sun's X4500 Thumper box: http://www.sun.com/servers/x64/x4500/ Basically, it's a server with 48 SATA hard drives. No hardware RAID. It's designed for Sun's ZFS filesystem. So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? Are we crazy to think this is even possible? Thanks! Norman Elton - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html It sounds VERY fun and exciting if you ask me! The most disks I've used when testing SW RAID was 10 with various raid settings. With that many drives you'd want RAID6 or RAID10 for sure incase more than one failed at the same time and definitely XFS/JFS/EXT4(?) as EXT3 is capped to 8TB. I'd be curious what kind of aggregate bandwidth you can get off of it with that many drives. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Tue, 18 Dec 2007, Thiemo Nagel wrote: Dear Norman, So... we're curious how Linux will handle such a beast. Has anyone run MD software RAID over so many disks? Then piled LVM/ext3 on top of that? Any suggestions? Are we crazy to think this is even possible? I'm running 22x 500GB disks attached to RocketRaid2340 and NFORCE-MCP55 onboard controllers on an Athlon DC 5000+ with 1GB RAM: 9746150400 blocks super 1.2 level 6, 256k chunk, algorithm 2 [22/22] Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 65536+0 records in 65536+0 records out 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 65536+0 records in 65536+0 records out 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s There were no problems up to now. (mkfs.ext3 wants -F to create a filesystem larger than 8TB. The hard maximum is 16TB, so you will need to create partitions, if your drives are larger than 350GB...) Kind regards, Thiemo Nagel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 65536+0 records in 65536+0 records out 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 27773+1 records in 27773+1 records out 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Tue, 18 Dec 2007, Thiemo Nagel wrote: Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s Interesting. Any ideas what could be the reason? How much do you get from a single drive? -- The Samsung HD501LJ that I'm using gives ~84MB/s when reading from the beginning of the disk. With RAID 5 I'm getting slightly better results (though I really wonder why, since naively I would expect identical read performance) but that does only account for a small part of the difference: 16k read64k write chunk sizeRAID 5 RAID 6 RAID 5 RAID 6 128k492 497 268 270 256k615 530 288 270 512k625 607 230 174 1024k 650 620 170 75 Kind regards, Thiemo # dd if=/dev/sdc of=/dev/null bs=1M count=1024 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 13.8108 seconds, 77.7 MB/s With more than 2x the drives I'd think you'd have faster speed, perhaps the contoller is the problem? I am using ICH8R (but the raid within linux) and 2 port SATA cards, each has their own dedicated bandwidth via PCI-e bus. I have also tried (on 3ware controllers exporting as JBOD etc, sw RAID5) with 10 disks, I saw similar performance with read but not write. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Tue, 18 Dec 2007, Jon Nelson wrote: On 12/18/07, Thiemo Nagel [EMAIL PROTECTED] wrote: Performance of the raw device is fair: # dd if=/dev/md2 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s Somewhat less through ext3 (created with -E stride=64): # dd if=largetestfile of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s Quite slow? 10 disks (raptors) raid 5 on regular sata controllers: # dd if=/dev/md3 of=/dev/zero bs=128k count=64k 8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s # dd if=bigfile of=/dev/zero bs=128k count=64k 3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s Interesting. Any ideas what could be the reason? How much do you get from a single drive? -- The Samsung HD501LJ that I'm using gives ~84MB/s when reading from the beginning of the disk. With RAID 5 I'm getting slightly better results (though I really wonder why, since naively I would expect identical read performance) but that does only account for a small part of the difference: 16k read64k write chunk sizeRAID 5 RAID 6 RAID 5 RAID 6 128k492 497 268 270 256k615 530 288 270 512k625 607 230 174 1024k 650 620 170 75 It strikes me that these numbers are meaningless without knowing if that is actual data-to-disk or data-to-memcache-and-some-to-disk-too. Later versions of 'dd' offer 'conv=fdatasync' which is really handy (call fdatasync on the output file, syncing JUST the one file, right before close). Otherwise, oflags=direct will (try) to bypass the page/block cache. I can get really impressive numbers, too (over 200MB/s on a single disk capable of 70MB/s) when I (mis)use dd without fdatasync, et al. The variation in reported performance can be really huge without understanding that you aren't actually testing the DISK I/O but *some* disk I/O and *some* memory caching. Ok-- How's this for caching, a DD over the entire RAID device: $ /usr/bin/time dd if=/dev/zero of=file bs=1M dd: writing `file': No space left on device 1070704+0 records in 1070703+0 records out 1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Raid over 48 disks
On Tue, 18 Dec 2007, Guy Watkins wrote: } -Original Message- } From: [EMAIL PROTECTED] [mailto:linux-raid- } [EMAIL PROTECTED] On Behalf Of Brendan Conoboy } Sent: Tuesday, December 18, 2007 3:36 PM } To: Norman Elton } Cc: linux-raid@vger.kernel.org } Subject: Re: Raid over 48 disks } } Norman Elton wrote: } We're investigating the possibility of running Linux (RHEL) on top of } Sun's X4500 Thumper box: } } http://www.sun.com/servers/x64/x4500/ } } Neat- 6 8 port SATA controllers! It'll be worth checking to be sure } each controller has equal bandwidth. If some controllers are on slower } buses than others you may want to consider that and balance the md } device layout. Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays using 2 disks from each controller. That way any 1 controller can fail and your system will still be running. 6 disks will be used for redundancy. Or 6 8 disk RAID6 arrays using 1 disk from each controller). That way any 2 controllers can fail and your system will still be running. 12 disks will be used for redundancy. Might be too excessive! Combine them into a RAID0 array. Guy - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I'd be curious what the maximum aggregate bandwidth would be with RAID 0 of 48 disks on that controller.. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Raid over 48 disks
On Tue, 18 Dec 2007, Justin Piszcz wrote: On Tue, 18 Dec 2007, Guy Watkins wrote: } -Original Message- } From: [EMAIL PROTECTED] [mailto:linux-raid- } [EMAIL PROTECTED] On Behalf Of Brendan Conoboy } Sent: Tuesday, December 18, 2007 3:36 PM } To: Norman Elton } Cc: linux-raid@vger.kernel.org } Subject: Re: Raid over 48 disks } } Norman Elton wrote: } We're investigating the possibility of running Linux (RHEL) on top of } Sun's X4500 Thumper box: } } http://www.sun.com/servers/x64/x4500/ } } Neat- 6 8 port SATA controllers! It'll be worth checking to be sure } each controller has equal bandwidth. If some controllers are on slower } buses than others you may want to consider that and balance the md } device layout. Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays using 2 disks from each controller. That way any 1 controller can fail and your system will still be running. 6 disks will be used for redundancy. Or 6 8 disk RAID6 arrays using 1 disk from each controller). That way any 2 controllers can fail and your system will still be running. 12 disks will be used for redundancy. Might be too excessive! Combine them into a RAID0 array. Guy - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I'd be curious what the maximum aggregate bandwidth would be with RAID 0 of 48 disks on that controller.. A RAID 0 over all of the controllers rather, if possible.. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: optimal IO scheduler choice?
On Thu, 13 Dec 2007, Louis-David Mitterrand wrote: Hi, after reading some interesting suggestions on kernel tuning at: http://hep.kbfi.ee/index.php/IT/KernelTuning I am wondering whether 'deadline' is indeed the best IO scheduler (vs. anticipatory and cfq) for a soft raid5/6 partition on a server? What is the common wisdom on the subject among linux-raid users and developers? Thanks, - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I have found anticipatory to be the fastest. http://home.comcast.net/~jpiszcz/sched/cfq_vs_as_vs_deadline_vs_noop.html Sequential: Output of CFQ: (horrid): 311,683 KiB/s Output of AS: 443,103 KiB/s Input CFQ is a little faster. It depends on your workload I suppose. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reading takes 100% precedence over writes for mdadm+raid5?
On Thu, 6 Dec 2007, David Rees wrote: On Dec 6, 2007 1:06 AM, Justin Piszcz [EMAIL PROTECTED] wrote: On Wed, 5 Dec 2007, Jon Nelson wrote: I saw something really similar while moving some very large (300MB to 4GB) files. I was really surprised to see actual disk I/O (as measured by dstat) be really horrible. Any work-arounds, or just don't perform heavy reads the same time as writes? What kernel are you using? (Did I miss it in your OP?) The per-device write throttling in 2.6.24 should help significantly, have you tried the latest -rc and compared to your current kernel? -Dave 2.6.23.9-- thanks will try out the latest -rc or wait for 2.6.24! Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
On Thu, 6 Dec 2007, Andrew Morton wrote: On Sat, 1 Dec 2007 06:26:08 -0500 (EST) Justin Piszcz [EMAIL PROTECTED] wrote: I am putting a new machine together and I have dual raptor raid 1 for the root, which works just fine under all stress tests. Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on sale now adays): I ran the following: dd if=/dev/zero of=/dev/sdc dd if=/dev/zero of=/dev/sdd dd if=/dev/zero of=/dev/sde (as it is always a very good idea to do this with any new disk) And sometime along the way(?) (i had gone to sleep and let it run), this occurred: [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401 action 0x2 frozen Gee we're seeing a lot of these lately. [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in [42880.680292] res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10 (ATA bus error) [42881.841899] ata3: soft resetting port [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [42915.919042] ata3.00: qc timeout (cmd 0xec) [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5) [42915.919149] ata3.00: revalidation failed (errno=-5) [42915.919206] ata3: failed to recover some devices, retrying in 5 secs [42920.912458] ata3: hard resetting port [42926.411363] ata3: port is slow to respond, please be patient (Status 0x80) [42930.943080] ata3: COMRESET failed (errno=-16) [42930.943130] ata3: hard resetting port [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [42931.413523] ata3.00: configured for UDMA/133 [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4) [42931.413655] ata3: EH complete [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors (750156 MB) [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Usually when I see this sort of thing with another box I have full of raptors, it was due to a bad raptor and I never saw it again after I replaced the disk that it happened on, but that was using the Intel P965 chipset. For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge). I am going to do some further testing but does this indicate a bad drive? Bad cable? Bad connector? As you can see above, /dev/sdc stopped responding for a little bit and then the kernel reset the port. Why is this though? What is the likely root cause? Should I replace the drive? Obviously this is not normal and cannot be good at all, the idea is to put these drives in a RAID5 and if one is going to timeout that is going to cause the array to go degraded and thus be worthless in a raid5 configuration. Can anyone offer any insight here? It would be interesting to try 2.6.21 or 2.6.22. This was due to NCQ issues (disabling it fixed the problem). Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
On Sat, 1 Dec 2007, Justin Piszcz wrote: On Sat, 1 Dec 2007, Janek Kozicki wrote: Justin Piszcz said: (by the date of Sat, 1 Dec 2007 07:23:41 -0500 (EST)) dd if=/dev/zero of=/dev/sdc The purpose is with any new disk its good to write to all the blocks and let the drive to all of the re-mapping before you put 'real' data on it. Let it crap out or fail before I put my data on it. better use badblocks. It writes data, then reads it afterwards: In this example the data is semi random (quicker than /dev/urandom ;) badblocks -c 10240 -s -w -t random -v /dev/sdc -- Janek Kozicki | - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Will give this a shot and see if I can reproduce the error, thanks. The badblocks did not do anything; however, when I built a software raid 5 and the performed a dd: /usr/bin/time dd if=/dev/zero of=fill_disk bs=1M I saw this somewhere along the way: [30189.967531] RAID5 conf printout: [30189.967576] --- rd:3 wd:3 [30189.967617] disk 0, o:1, dev:sdc1 [30189.967660] disk 1, o:1, dev:sdd1 [30189.967716] disk 2, o:1, dev:sde1 [42332.936615] ata5.00: exception Emask 0x2 SAct 0x7000 SErr 0x0 action 0x2 frozen [42332.936706] ata5.00: spurious completions during NCQ issue=0x0 SAct=0x7000 FIS=004040a1:0800 [42332.936804] ata5.00: cmd 61/08:60:6f:4d:2a/00:00:27:00:00/40 tag 12 cdb 0x0 data 4096 out [42332.936805] res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 (HSM violation) [42332.936977] ata5.00: cmd 61/08:68:77:4d:2a/00:00:27:00:00/40 tag 13 cdb 0x0 data 4096 out [42332.936981] res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 (HSM violation) [42332.937162] ata5.00: cmd 61/00:70:0f:49:2a/04:00:27:00:00/40 tag 14 cdb 0x0 data 524288 out [42332.937163] res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 (HSM violation) [42333.240054] ata5: soft resetting port [42333.494462] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [42333.506592] ata5.00: configured for UDMA/133 [42333.506652] ata5: EH complete [42333.506741] sd 4:0:0:0: [sde] 1465149168 512-byte hardware sectors (750156 MB) [42333.506834] sd 4:0:0:0: [sde] Write Protect is off [42333.506887] sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00 [42333.506905] sd 4:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Next test, I will turn off NCQ and try to make the problem re-occur. If anyone else has any thoughts here..? I ran long smart tests on all 3 disks, they all ran successfully. Perhaps these drives need to be NCQ BLACKLISTED with the P35 chipset? Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Spontaneous rebuild
On Sun, 2 Dec 2007, Oliver Martin wrote: [Please CC me on replies as I'm not subscribed] Hello! I've been experimenting with software RAID a bit lately, using two external 500GB drives. One is connected via USB, one via Firewire. It is set up as a RAID5 with LVM on top so that I can easily add more drives when I run out of space. About a day after the initial setup, things went belly up. First, EXT3 reported strange errors: EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system zone - blocks from 106561536, length 1 EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system zone - blocks from 106561537, length 1 ... There were literally hundreds of these, and they came back immediately when I reformatted the array. So I tried ReiserFS, which worked fine for about a day. Then I got errors like these: ReiserFS: warning: is_tree_node: node level 0 does not match to the expected one 2 ReiserFS: dm-0: warning: vs-5150: search_by_key: invalid format found in block 69839092. Fsck? ReiserFS: dm-0: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [6 10 0x0 SD] Again, hundreds. So I ran badblocks on the LVM volume, and it reported some bad blocks near the end. Running badblocks on the md array worked, so I recreated the LVM stuff and attributed the failures to undervolting experiments I had been doing (this is my old laptop running as a server). Anyway, the problems are back: To test my theory that everything is alright with the CPU running within its specs, I removed one of the drives while copying some large files yesterday. Initially, everything seemed to work out nicely, and by the morning, the rebuild had finished. Again, I unmounted the filesystem and ran badblocks -svn on the LVM. It ran without gripes for some hours, but just now I saw md had started to rebuild the array again out of the blue: Dec 1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device using ehci_hcd and address 4 Dec 2 01:06:02 quassel kernel: md: data-check of RAID array md0 Dec 2 01:06:02 quassel kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Dec 2 01:06:02 quassel kernel: md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for data-check. Dec 2 01:06:02 quassel kernel: md: using 128k window, over a total of 488383936 blocks. Dec 2 03:57:24 quassel kernel: usb 4-5.2: reset high speed USB device using ehci_hcd and address 4 I'm not sure the USB resets are related to the problem - device 4-5.2 is part of the array, but I get these sometimes at random intervals and they don't seem to hurt normally. Besides, the first one was long before the rebuild started, and the second one long afterwards. Any ideas why md is rebuilding the array? And could this be related to the bad blocks problem I had first? badblocks is still running, I'll post an update when it is finished. In the meantime, mdadm --detail /dev/md0 and mdadm --examine /dev/sd[bc]1 don't give me any clues as to what went wrong, both disks are marked as active sync, and the whole array is active, recovering. Before I forget, I'm running 2.6.23.1 with this config: http://stud4.tuwien.ac.at/~e0626486/config-2.6.23.1-hrt3-fw Thanks, Oliver - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html It rebuilds the array because 'something' is causing device resets/timeouts on your USB device: Dec 1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device using ehci_hcd and address 4 Naturally, when it is reset, the device is disconnected and then re-appears, when MD see's this it rebuilds the array. Why it is timing out/resetting the device, that is what you need to find out. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
On Sun, 2 Dec 2007, Janek Kozicki wrote: Justin Piszcz said: (by the date of Sun, 2 Dec 2007 04:11:59 -0500 (EST)) The badblocks did not do anything; however, when I built a software raid 5 and the performed a dd: /usr/bin/time dd if=/dev/zero of=fill_disk bs=1M I saw this somewhere along the way: [42332.936706] ata5.00: spurious completions during NCQ issue=0x0 SAct=0x7000 FIS=004040a1:0800 [42333.240054] ata5: soft resetting port I know nothing about NCQ ;) But I find it interesting that *slower* access worked fine while *fast* access didn't. If I understand you correctly: - badblocks is slower, and you said that it worked flawlessly, right? - getting from /dev/zero is the fastest thing you can do, and it fails... I'd check jumpers on HDD and if there is any, set it to 1.5 Gb speed instead of default 3.0 Gb. Or sth. along that way. I remember seeing such jumper on one of my HDDs (I don't remember the exact speed numbers though). Also on one forum I remember about problems occurring when HDD was working at maximum speed, which was faster than the IO controller could handle. I dunno. It's just what came to my mind... -- Janek Kozicki | - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Thanks for the suggestions, but BTW NCQ OFF on (raptors anyway) is 30 to 50 megabytes per second faster in a RAID 5 configuration. NCQ slows things down for those disks. There are no jumpers (by default) on the 750GB WD Caviar's btw.. So far with NCQ off I've been pounding the disks and have not been able to reproduce the error but with NCQ on and some DD's or some raid creations, it is reproducible (or appears to be)-- did it twice. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
On Mon, 3 Dec 2007, Michael Tokarev wrote: Justin Piszcz said: (by the date of Sun, 2 Dec 2007 04:11:59 -0500 (EST)) The badblocks did not do anything; however, when I built a software raid 5 and the performed a dd: /usr/bin/time dd if=/dev/zero of=fill_disk bs=1M I saw this somewhere along the way: [42332.936706] ata5.00: spurious completions during NCQ issue=0x0 SAct=0x7000 FIS=004040a1:0800 [42333.240054] ata5: soft resetting port There's some (probably timing-related) bug with spurious completions during NCQ. Alot of people are seeing this same effect with different drives and controllers. Tejun is working on it. It's different to reproduce. Search for spurious completion - there are many hits... /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Thanks will check it out. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Reading takes 100% precedence over writes for mdadm+raid5?
root 2206 1 4 Dec02 ?00:10:37 dd if /dev/zero of 1.out bs 1M root 2207 1 4 Dec02 ?00:10:38 dd if /dev/zero of 2.out bs 1M root 2208 1 4 Dec02 ?00:10:35 dd if /dev/zero of 3.out bs 1M root 2209 1 4 Dec02 ?00:10:45 dd if /dev/zero of 4.out bs 1M root 2210 1 4 Dec02 ?00:10:35 dd if /dev/zero of 5.out bs 1M root 2211 1 4 Dec02 ?00:10:35 dd if /dev/zero of 6.out bs 1M root 2212 1 4 Dec02 ?00:10:30 dd if /dev/zero of 7.out bs 1M root 2213 1 4 Dec02 ?00:10:42 dd if /dev/zero of 8.out bs 1M root 2214 1 4 Dec02 ?00:10:35 dd if /dev/zero of 9.out bs 1M root 2215 1 4 Dec02 ?00:10:37 dd if /dev/zero of 10.out bs 1M root 3080 24.6 0.0 10356 1672 ?D01:22 5:51 dd if /dev/md3 of /dev/null bs 1M Was curious if when running 10 DD's (which are writing to the RAID 5) fine, no issues, suddenly all go into D-state and let the read/give it 100% priority? Is this normal? # du -sb . ; sleep 300; du -sb . 1115590287487 . 1115590287487 . Here my my raid5 config: # mdadm -D /dev/md3 /dev/md3: Version : 00.90.03 Creation Time : Sun Dec 2 12:15:20 2007 Raid Level : raid5 Array Size : 1465143296 (1397.27 GiB 1500.31 GB) Used Dev Size : 732571648 (698.63 GiB 750.15 GB) Raid Devices : 3 Total Devices : 3 Preferred Minor : 3 Persistence : Superblock is persistent Update Time : Sun Dec 2 22:00:54 2007 State : active Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 1024K UUID : fea48e85:ddd2c33f:d19da839:74e9c858 (local to host box1) Events : 0.15 Number Major Minor RaidDevice State 0 8 330 active sync /dev/sdc1 1 8 491 active sync /dev/sdd1 2 8 652 active sync /dev/sde1 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Spontaneous rebuild
On Mon, 3 Dec 2007, Neil Brown wrote: On Sunday December 2, [EMAIL PROTECTED] wrote: Anyway, the problems are back: To test my theory that everything is alright with the CPU running within its specs, I removed one of the drives while copying some large files yesterday. Initially, everything seemed to work out nicely, and by the morning, the rebuild had finished. Again, I unmounted the filesystem and ran badblocks -svn on the LVM. It ran without gripes for some hours, but just now I saw md had started to rebuild the array again out of the blue: Dec 1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device using ehci_hcd and address 4 Dec 2 01:06:02 quassel kernel: md: data-check of RAID array md0 ^^ Dec 2 01:06:02 quassel kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Dec 2 01:06:02 quassel kernel: md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for data-check. ^^ Dec 2 01:06:02 quassel kernel: md: using 128k window, over a total of 488383936 blocks. Dec 2 03:57:24 quassel kernel: usb 4-5.2: reset high speed USB device using ehci_hcd and address 4 This isn't a resync, it is a data check. Dec 2 is the first Sunday of the month. You probably have a crontab entries that does echo check /sys/block/mdX/md/sync_action early on the first Sunday of the month. I know that Debian does this. It is good to do this occasionally to catch sleeping bad blocks. While we are on the subject of bad blocks, is it possible to do what 3ware raid controllers do without an external card? They know when a block is bad and they remap it to another part of the array etc, where as with software raid you never know this is happening until the disk is dead. For example with 3dm2 it notifies you if you have e-mail alerts set to 2 (warn) it will e-mail you every time there is a sector re-allocation, is this possible with software raid or does it *require* HW raid/external controller? Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?
Quick question, Setup a new machine last night with two raptor 150 disks. Setup RAID1 as I do everywhere else, 0.90.03 superblocks (in order to be compatible with LILO, if you use 1.x superblocks with LILO you can't boot), and then: /dev/sda1+sdb1 - /dev/md0 - swap /dev/sda2+sdb2 - /dev/md1 - /boot (ext3) /dev/sda3+sdb3 - /dev/md2 - / (xfs) All works fine, no issues... Quick question though, I turned off the machine, disconnected /dev/sda from the machine, boot from /dev/sdb, no problems, shows as degraded RAID1. Turn the machine off. Re-attach the first drive. When I boot my first partition either re-synced by itself or it was not degraded, was is this? So two questions: 1) If it rebuilt by itself, how come it only rebuilt /dev/md0? 2) If it did not rebuild, is it because the kernel knows it does not need to re-calculate parity etc for swap? I had to: mdadm /dev/md1 -a /dev/sda2 and mdadm /dev/md2 -a /dev/sda3 To rebuild the /boot and /, which worked fine, I am just curious though why it works like this, I figured it would be all or nothing. More info: Not using ANY initramfs/initrd images, everything is compiled into 1 kernel image (makes things MUCH simpler and the expected device layout etc is always the same, unlike initrd/etc). Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
I am putting a new machine together and I have dual raptor raid 1 for the root, which works just fine under all stress tests. Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on sale now adays): I ran the following: dd if=/dev/zero of=/dev/sdc dd if=/dev/zero of=/dev/sdd dd if=/dev/zero of=/dev/sde (as it is always a very good idea to do this with any new disk) And sometime along the way(?) (i had gone to sleep and let it run), this occurred: [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401 action 0x2 frozen [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 0x0 data 512 in [42880.680292] res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10 (ATA bus error) [42881.841899] ata3: soft resetting port [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [42915.919042] ata3.00: qc timeout (cmd 0xec) [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5) [42915.919149] ata3.00: revalidation failed (errno=-5) [42915.919206] ata3: failed to recover some devices, retrying in 5 secs [42920.912458] ata3: hard resetting port [42926.411363] ata3: port is slow to respond, please be patient (Status 0x80) [42930.943080] ata3: COMRESET failed (errno=-16) [42930.943130] ata3: hard resetting port [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [42931.413523] ata3.00: configured for UDMA/133 [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4) [42931.413655] ata3: EH complete [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors (750156 MB) [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Usually when I see this sort of thing with another box I have full of raptors, it was due to a bad raptor and I never saw it again after I replaced the disk that it happened on, but that was using the Intel P965 chipset. For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge). I am going to do some further testing but does this indicate a bad drive? Bad cable? Bad connector? As you can see above, /dev/sdc stopped responding for a little bit and then the kernel reset the port. Why is this though? What is the likely root cause? Should I replace the drive? Obviously this is not normal and cannot be good at all, the idea is to put these drives in a RAID5 and if one is going to timeout that is going to cause the array to go degraded and thus be worthless in a raid5 configuration. Can anyone offer any insight here? Thank you, Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
On Sat, 1 Dec 2007, Jan Engelhardt wrote: On Dec 1 2007 06:26, Justin Piszcz wrote: I ran the following: dd if=/dev/zero of=/dev/sdc dd if=/dev/zero of=/dev/sdd dd if=/dev/zero of=/dev/sde (as it is always a very good idea to do this with any new disk) Why would you care about what's on the disk? fdisk, mkfs and the day-to-day operation will overwrite it _anyway_. (If you think the disk is not empty, you should look at it and copy off all usable warez beforehand :-) The purpose is with any new disk its good to write to all the blocks and let the drive to all of the re-mapping before you put 'real' data on it. Let it crap out or fail before I put my data on it. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?
On Sat, 1 Dec 2007, Jan Engelhardt wrote: On Dec 1 2007 07:12, Justin Piszcz wrote: On Sat, 1 Dec 2007, Jan Engelhardt wrote: On Dec 1 2007 06:19, Justin Piszcz wrote: RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if you use 1.x superblocks with LILO you can't boot) Says who? (Don't use LILO ;-) I like LILO :) LILO cares much less about disk layout / filesystems than GRUB does, so I would have expected LILO to cope with all sorts of superblocks. OTOH I would suspect GRUB to only handle 0.90 and 1.0, where the MDSB is at the end of the disk = the filesystem SB is at the very beginning. So two questions: 1) If it rebuilt by itself, how come it only rebuilt /dev/md0? So md1/md2 was NOT rebuilt? Correct. Well it should, after they are readded using -a. If they still don't, then perhaps another resync is in progress. There was nothing in progress, md0 was synced up and md1,md2 = degraded. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?
On Sat, 1 Dec 2007, Jan Engelhardt wrote: On Dec 1 2007 06:19, Justin Piszcz wrote: RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if you use 1.x superblocks with LILO you can't boot) Says who? (Don't use LILO ;-) I like LILO :) , and then: /dev/sda1+sdb1 - /dev/md0 - swap /dev/sda2+sdb2 - /dev/md1 - /boot (ext3) /dev/sda3+sdb3 - /dev/md2 - / (xfs) All works fine, no issues... Quick question though, I turned off the machine, disconnected /dev/sda from the machine, boot from /dev/sdb, no problems, shows as degraded RAID1. Turn the machine off. Re-attach the first drive. When I boot my first partition either re-synced by itself or it was not degraded, was is this? If md0 was not touched (written to) after you disconnected sda, it also should not be in a degraded state. So two questions: 1) If it rebuilt by itself, how come it only rebuilt /dev/md0? So md1/md2 was NOT rebuilt? Correct. 2) If it did not rebuild, is it because the kernel knows it does not need to re-calculate parity etc for swap? Kernel does not know what's inside an md usually. And it should not try to be smart. Ok. I had to: mdadm /dev/md1 -a /dev/sda2 and mdadm /dev/md2 -a /dev/sda3 To rebuild the /boot and /, which worked fine, I am just curious though why it works like this, I figured it would be all or nothing. Devices are not automatically readded. Who knows, maybe you inserted a different disk into sda which you don't want to be overwritten. Makes sense, I just wanted to confirm that it was normal.. More info: Not using ANY initramfs/initrd images, everything is compiled into 1 kernel image (makes things MUCH simpler and the expected device layout etc is always the same, unlike initrd/etc). My expected device layout is also always the same, _with_ initrd. Why? Simply because mdadm.conf is copied to the initrd, and mdadm will use your defined order. That is another way as well, people seem to be divided. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
On Sat, 1 Dec 2007, Janek Kozicki wrote: Justin Piszcz said: (by the date of Sat, 1 Dec 2007 07:23:41 -0500 (EST)) dd if=/dev/zero of=/dev/sdc The purpose is with any new disk its good to write to all the blocks and let the drive to all of the re-mapping before you put 'real' data on it. Let it crap out or fail before I put my data on it. better use badblocks. It writes data, then reads it afterwards: In this example the data is semi random (quicker than /dev/urandom ;) badblocks -c 10240 -s -w -t random -v /dev/sdc -- Janek Kozicki | - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Will give this a shot and see if I can reproduce the error, thanks. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: raid5 hangs
On Wed, 14 Nov 2007, Peter Magnusson wrote: On Wed, 14 Nov 2007, Justin Piszcz wrote: This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5 bio* patches are applied. Ok, good to know. Do you know when it first appeared because it existed in linux-2.6.22.3 also...? I am unsure, I and others started noticing it in 2.6.23 mainly; again, not sure, will let others answer this one. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: raid5 hangs
On Wed, 14 Nov 2007, Bill Davidsen wrote: Justin Piszcz wrote: This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5 bio* patches are applied. Note below he's running 2.6.22.3 which doesn't have the bug unless -STABLE added it. So should not really be in 2.6.22.anything. I assume you're talking the endless write or bio issue? The bio issue is the root cause of the bug yes? -- I am uncertain but I remember this happening in the past but I thought it was something I was doing (possibly 2.6.23) so it may have been happenign earlier than that but I am not positive. Justin. On Wed, 14 Nov 2007, Peter Magnusson wrote: Hey. [1.] One line summary of the problem: raid5 hangs and use 100% cpu [2.] Full description of the problem/report: I have used 2.6.18 for 284 days or something until my powersupply died, no problem what so ever duing that time. After that forced reboot I did these changes; Put in 2 GB more memory so I have 3 GB instead of 1 GB, two disks in the raid5 got badblocks so I didnt trust them anymore so I bought new disks (I managed to save the raid5). I have 6x300 GB in a raid5. Two of them are now 320 GB so created a small raid1 also. That raid5 is encrypted with aes-cbc-plain. The raid1 is encrypted with aes-cbc-essiv:sha256. I compiled linux-2.6.22.3 and started to use that. I used the same .config as in default FC5, I think i just selected P4 cpu and preemptive kernel type. After 11 or 12 days the computer froze, I wasnt home when it happend and couldnt fix it for like 3 days. It was just to reboot it as it wasnt possible to login remotely or on console. It did respond to ping however. After reboot it rebuilded the raid5. Then it happend again after approx the same time, 11 or 12 days. I noticed that the process md1_raid5 used 100% cpu all the time. After reboot it rebuilded the raid5. I compiled linux-2.6.23. And then... it happend again... After about the same time as before. md1_raid5 used 100% cpu. I also noticed that I wasnt able to save anything in my homedir, it froze during save. I could read from it however. My homedir isnt on raid5 but its encrypted. Its not on any disk that has to do with raid. This problem didnt happend when I used 2.6.18. Currently I use 2.6.18 as I kinda need the computer stable. After reboot it rebuilded the raid5. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Thu, 8 Nov 2007, Carlos Carvalho wrote: Jeff Lessem ([EMAIL PROTECTED]) wrote on 6 November 2007 22:00: Dan Williams wrote: The following patch, also attached, cleans up cases where the code looks at sh-ops.pending when it should be looking at the consistent stack-based snapshot of the operations flags. I tried this patch (against a stock 2.6.23), and it did not work for me. Not only did I/O to the effected RAID5 XFS partition stop, but also I/O to all other disks. I was not able to capture any debugging information, but I should be able to do that tomorrow when I can hook a serial console to the machine. I'm not sure if my problem is identical to these others, as mine only seems to manifest with RAID5+XFS. The RAID rebuilds with no problem, and I've not had any problems with RAID5+ext3. Us too! We're stuck trying to build a disk server with several disks in a raid5 array, and the rsync from the old machine stops writing to the new filesystem. It only happens under heavy IO. We can make it lock without rsync, using 8 simultaneous dd's to the array. All IO stops, including the resync after a newly created raid or after an unclean reboot. We could not trigger the problem with ext3 or reiser3; it only happens with xfs. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Including XFS mailing list as well can you provide more information to them? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Thu, 8 Nov 2007, BERTRAND Joël wrote: BERTRAND Joël wrote: Chuck Ebbert wrote: On 11/05/2007 03:36 AM, BERTRAND Joël wrote: Neil Brown wrote: On Sunday November 4, [EMAIL PROTECTED] wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. At a guess (I haven't looked closely) I'd say it is the bug that was meant to be fixed by commit 4ae3f847e49e3787eca91bced31f8fd328d50496 except that patch applied badly and needed to be fixed with the following patch (not in git yet). These have been sent to stable@ and should be in the queue for 2.6.23.2 My linux-2.6.23/drivers/md/raid5.c contains your patch for a long time : ... spin_lock(sh-lock); clear_bit(STRIPE_HANDLE, sh-state); clear_bit(STRIPE_DELAYED, sh-state); s.syncing = test_bit(STRIPE_SYNCING, sh-state); s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state); s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state); /* Now to look around and see what can be done */ /* clean-up completed biofill operations */ if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) { clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending); clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack); clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete); } rcu_read_lock(); for (i=disks; i--; ) { mdk_rdev_t *rdev; struct r5dev *dev = sh-dev[i]; ... but it doesn't fix this bug. Did that chunk starting with clean-up completed biofill operations end up where it belongs? The patch with the big context moves it to a different place from where the original one puts it when applied to 2.6.23... Lately I've seen several problems where the context isn't enough to make a patch apply properly when some offsets have changed. In some cases a patch won't apply at all because two nearly-identical areas are being changed and the first chunk gets applied where the second one should, leaving nowhere for the second chunk to apply. I always apply this kind of patches by hands, and no by patch command. Last patch sent here seems to fix this bug : gershwin:[/usr/scripts] cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md7 : active raid1 sdi1[2] md_d0p1[0] 1464725632 blocks [2/1] [U_] [=...] recovery = 27.1% (396992504/1464725632) finish=1040.3min speed=17104K/sec Resync done. Patch fix this bug. Regards, JKB Excellent! I cannot easily re-produce the bug on my system so I will wait for the next stable patch set to include it and let everyone know if it happens again, thanks.
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Tue, 6 Nov 2007, BERTRAND Joël wrote: Done. Here is obtained ouput : [ 1265.899068] check 4: state 0x6 toread read write f800fdd4e360 written [ 1265.941328] check 3: state 0x1 toread read write written [ 1265.972129] check 2: state 0x1 toread read write written For information, after crash, I have : Root poulenc:[/sys/block] cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU] Regards, JKB After the crash it is not 'resyncing' ? Justin.
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Tue, 6 Nov 2007, BERTRAND Joël wrote: Justin Piszcz wrote: On Tue, 6 Nov 2007, BERTRAND Joël wrote: Done. Here is obtained ouput : [ 1265.899068] check 4: state 0x6 toread read write f800fdd4e360 written [ 1265.941328] check 3: state 0x1 toread read write written [ 1265.972129] check 2: state 0x1 toread read write written For information, after crash, I have : Root poulenc:[/sys/block] cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU] Regards, JKB After the crash it is not 'resyncing' ? No, it isn't... JKB After any crash/unclean shutdown the RAID should resync, if it doesn't, that's not good, I'd suggest running a raid check. The 'repair' is supposed to clean it, in some cases (md0=swap) it gets dirty again. Tue May 8 09:19:54 EDT 2007: Executing RAID health check for /dev/md0... Tue May 8 09:19:55 EDT 2007: Executing RAID health check for /dev/md1... Tue May 8 09:19:56 EDT 2007: Executing RAID health check for /dev/md2... Tue May 8 09:19:57 EDT 2007: Executing RAID health check for /dev/md3... Tue May 8 10:09:58 EDT 2007: cat /sys/block/md0/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 2176 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md1/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md2/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md3/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: The meta-device /dev/md0 has 2176 mismatched sectors. Tue May 8 10:09:58 EDT 2007: Executing repair on /dev/md0 Tue May 8 10:09:59 EDT 2007: The meta-device /dev/md1 has no mismatched sectors. Tue May 8 10:10:00 EDT 2007: The meta-device /dev/md2 has no mismatched sectors. Tue May 8 10:10:01 EDT 2007: The meta-device /dev/md3 has no mismatched sectors. Tue May 8 10:20:02 EDT 2007: All devices are clean... Tue May 8 10:20:02 EDT 2007: cat /sys/block/md0/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 2176 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md1/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md2/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md3/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Mon, 5 Nov 2007, Dan Williams wrote: On 11/4/07, Justin Piszcz [EMAIL PROTECTED] wrote: On Mon, 5 Nov 2007, Neil Brown wrote: On Sunday November 4, [EMAIL PROTECTED] wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. At a guess (I haven't looked closely) I'd say it is the bug that was meant to be fixed by commit 4ae3f847e49e3787eca91bced31f8fd328d50496 except that patch applied badly and needed to be fixed with the following patch (not in git yet). These have been sent to stable@ and should be in the queue for 2.6.23.2 Ah, thanks Neil, will be updating as soon as it is released, thanks. Are you seeing the same md thread takes 100% of the CPU that Joël is reporting? Yes, in another e-mail I posted the top output with md3_raid5 at 100%. Justin.
2.6.23.1: mdadm/raid5 hung/d-state
# ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. # mdadm -D /dev/md3 /dev/md3: Version : 00.90.03 Creation Time : Wed Aug 22 10:38:53 2007 Raid Level : raid5 Array Size : 1318680576 (1257.59 GiB 1350.33 GB) Used Dev Size : 146520064 (139.73 GiB 150.04 GB) Raid Devices : 10 Total Devices : 10 Preferred Minor : 3 Persistence : Superblock is persistent Update Time : Sun Nov 4 06:38:29 2007 State : active Active Devices : 10 Working Devices : 10 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 1024K UUID : e37a12d1:1b0b989a:083fb634:68e9eb49 Events : 0.4309 Number Major Minor RaidDevice State 0 8 330 active sync /dev/sdc1 1 8 491 active sync /dev/sdd1 2 8 652 active sync /dev/sde1 3 8 813 active sync /dev/sdf1 4 8 974 active sync /dev/sdg1 5 8 1135 active sync /dev/sdh1 6 8 1296 active sync /dev/sdi1 7 8 1457 active sync /dev/sdj1 8 8 1618 active sync /dev/sdk1 9 8 1779 active sync /dev/sdl1 If I wanted to find out what is causing this, what type of debugging would I have to enable to track it down? Any attempt to read/write files on the devices fails (also going into d-state). Is there any useful information I can get currently before rebooting the machine? # pwd /sys/block/md3/md # ls array_state dev-sdj1/ rd2@ stripe_cache_active bitmap_set_bits dev-sdk1/ rd3@ stripe_cache_size chunk_size dev-sdl1/ rd4@ suspend_hi component_size layoutrd5@ suspend_lo dev-sdc1/level rd6@ sync_action dev-sdd1/metadata_version rd7@ sync_completed dev-sde1/mismatch_cnt rd8@ sync_speed dev-sdf1/new_dev rd9@ sync_speed_max dev-sdg1/raid_disksreshape_position sync_speed_min dev-sdh1/rd0@ resync_start dev-sdi1/rd1@ safe_mode_delay # cat array_state active-idle # cat mismatch_cnt 0 # cat stripe_cache_active 1 # cat stripe_cache_size 16384 # cat sync_action idle # cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : active raid1 sdb2[1] sda2[0] 136448 blocks [2/2] [UU] md2 : active raid1 sdb3[1] sda3[0] 129596288 blocks [2/2] [UU] md3 : active raid5 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0] 1318680576 blocks level 5, 1024k chunk, algorithm 2 [10/10] [UU] md0 : active raid1 sdb1[1] sda1[0] 16787776 blocks [2/2] [UU] unused devices: none # Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state (md3_raid5 stuck in endless loop?)
: 68 high: 186 batch: 31 cpu: 1 pcp: 1 count: 9 high: 62 batch: 15 vm stats threshold: 42 cpu: 2 pcp: 0 count: 79 high: 186 batch: 31 cpu: 2 pcp: 1 count: 10 high: 62 batch: 15 vm stats threshold: 42 cpu: 3 pcp: 0 count: 47 high: 186 batch: 31 cpu: 3 pcp: 1 count: 60 high: 62 batch: 15 vm stats threshold: 42 all_unreclaimable: 0 prev_priority: 12 start_pfn: 1048576 On Sun, 4 Nov 2007, Justin Piszcz wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. # mdadm -D /dev/md3 /dev/md3: Version : 00.90.03 Creation Time : Wed Aug 22 10:38:53 2007 Raid Level : raid5 Array Size : 1318680576 (1257.59 GiB 1350.33 GB) Used Dev Size : 146520064 (139.73 GiB 150.04 GB) Raid Devices : 10 Total Devices : 10 Preferred Minor : 3 Persistence : Superblock is persistent Update Time : Sun Nov 4 06:38:29 2007 State : active Active Devices : 10 Working Devices : 10 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 1024K UUID : e37a12d1:1b0b989a:083fb634:68e9eb49 Events : 0.4309 Number Major Minor RaidDevice State 0 8 330 active sync /dev/sdc1 1 8 491 active sync /dev/sdd1 2 8 652 active sync /dev/sde1 3 8 813 active sync /dev/sdf1 4 8 974 active sync /dev/sdg1 5 8 1135 active sync /dev/sdh1 6 8 1296 active sync /dev/sdi1 7 8 1457 active sync /dev/sdj1 8 8 1618 active sync /dev/sdk1 9 8 1779 active sync /dev/sdl1 If I wanted to find out what is causing this, what type of debugging would I have to enable to track it down? Any attempt to read/write files on the devices fails (also going into d-state). Is there any useful information I can get currently before rebooting the machine? # pwd /sys/block/md3/md # ls array_state dev-sdj1/ rd2@ stripe_cache_active bitmap_set_bits dev-sdk1/ rd3@ stripe_cache_size chunk_size dev-sdl1/ rd4@ suspend_hi component_size layoutrd5@ suspend_lo dev-sdc1/level rd6@ sync_action dev-sdd1/metadata_version rd7@ sync_completed dev-sde1/mismatch_cnt rd8@ sync_speed dev-sdf1/new_dev rd9@ sync_speed_max dev-sdg1/raid_disksreshape_position sync_speed_min dev-sdh1/rd0@ resync_start dev-sdi1/rd1@ safe_mode_delay # cat array_state active-idle # cat mismatch_cnt 0 # cat stripe_cache_active 1 # cat stripe_cache_size 16384 # cat sync_action idle # cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : active raid1 sdb2[1] sda2[0] 136448 blocks [2/2] [UU] md2 : active raid1 sdb3[1] sda3[0] 129596288 blocks [2/2] [UU] md3 : active raid5 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0] 1318680576 blocks level 5, 1024k chunk, algorithm 2 [10/10] [UU] md0 : active raid1 sdb1[1] sda1[0] 16787776 blocks [2/2] [UU] unused devices: none # Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Sun, 4 Nov 2007, BERTRAND Joël wrote: Justin Piszcz wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. Same observation here (kernel 2.6.23). I can see this bug when I try to synchronize a raid1 volume over iSCSI (each element is a raid5 volume), or sometimes only with a 1,5 TB raid5 volume. When this bug occurs, md subsystem eats 100% of one CPU and pdflush remains in D state too. What is your architecture ? I use two 32-threads T1000 (sparc64), and I'm trying to determine if this bug is arch specific. Regards, JKB Using x86_64 here (Q6600/Intel DG965WH). Justin.
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Mon, 5 Nov 2007, Neil Brown wrote: On Sunday November 4, [EMAIL PROTECTED] wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. At a guess (I haven't looked closely) I'd say it is the bug that was meant to be fixed by commit 4ae3f847e49e3787eca91bced31f8fd328d50496 except that patch applied badly and needed to be fixed with the following patch (not in git yet). These have been sent to stable@ and should be in the queue for 2.6.23.2 Ah, thanks Neil, will be updating as soon as it is released, thanks. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 resync and read errors (loop)
On Fri, 26 Oct 2007, Filippo Carletti wrote: Is there a way to control an array resync process? In particular, is it possible to skip read errors? My setup: LVM2 Phisical Volume over a two disks MD RAID1 array Logical Volumes didn't span whole PV, some PE free at the end of disks What happened: disk1 broke I installed new disk1 and started sync from disk2 to disk1 but at 99.9% disk2 gave some read errors and the sync process started again, over and over I didn't notice errors on disk2 because they were in unallocated PEs at the end of the disk. The MD device spans the whole disk, while the LVs don't. I'd like to complete sync ignoring read errors, then replace disk2. I think this is a not-so-uncommon situation, leaving some PEs free for future expansion is a good idea and errors go undetected until you use those free areas. Thanks. -- Ciao, Filippo - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html This is why you should scrub your RAID. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software RAID when it works and when it doesn't
On Fri, 26 Oct 2007, Goswin von Brederlow wrote: Justin Piszcz [EMAIL PROTECTED] writes: On Fri, 19 Oct 2007, Alberto Alonso wrote: On Thu, 2007-10-18 at 17:26 +0200, Goswin von Brederlow wrote: Mike Accetta [EMAIL PROTECTED] writes: What I would like to see is a timeout driven fallback mechanism. If one mirror does not return the requested data within a certain time (say 1 second) then the request should be duplicated on the other mirror. If the first mirror later unchokes then it remains in the raid, if it fails it gets removed. But (at least reads) should not have to wait for that process. Even better would be if some write delay could also be used. The still working mirror would get an increase in its serial (so on reboot you know one disk is newer). If the choking mirror unchokes then it can write back all the delayed data and also increase its serial to match. Otherwise it gets really failed. But you might have to use bitmaps for this or the cache size would limit its usefullnes. MfG Goswin I think a timeout on both: reads and writes is a must. Basically I believe that all problems that I've encountered issues using software raid would have been resolved by using a timeout within the md code. This will keep a server from crashing/hanging when the underlying driver doesn't properly handle hard drive problems. MD can be smarter than the dumb drivers. Just my thoughts though, as I've never got an answer as to whether or not md can implement its own timeouts. Alberto - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I have a question with re-mapping sectors, can software raid be as efficient or good at remapping bad sectors as an external raid controller for, e.g., raid 10 or raid5? Justin. Software raid makes no remapping of bad sectors at all. It assumes the disks will do sufficient remapping. MfG Goswin Thanks, this is what I was looking for. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Test
Success. On Thu, 25 Oct 2007, Daniel L. Miller wrote: Sorry for consuming bandwidth - but all of a sudden I'm not seeing messages. Is this going through? -- Daniel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Test 2
Success 2. On Thu, 25 Oct 2007, Daniel L. Miller wrote: Thanks for the test responses - I have re-subscribed...if I see this myself...I'm back! -- Daniel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flaky controller or disk error?
On Mon, 22 Oct 2007, Louis-David Mitterrand wrote: Hi, [using kernel 2.6.23 and mdadm 2.6.3+20070929] I have a rather flaky sata controller with which I am trying to resync a raid5 array. It usually starts failing after 40% of the resync is done. Short of changing the controller (which I will do later this week), is there a way to have mdmadm resume the resync where it left at reboot time? Here is the error I am seeing in the syslog. Can this actually be a disk error? Oct 18 11:54:34 sylla kernel: ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x1 action 0x2 frozen Oct 18 11:54:34 sylla kernel: ata1.00: irq_stat 0x0040, PHY RDY changed Oct 18 11:54:34 sylla kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 Oct 18 11:54:34 sylla kernel: res 40/00:00:19:26:33/00:00:3a:00:00/40 Emask 0x10 (ATA bus error) Oct 18 11:54:35 sylla kernel: ata1: soft resetting port Oct 18 11:54:40 sylla kernel: ata1: failed to reset engine (errno=-95)4ata1: port is slow to respond, please be patient (Status 0xd0) Oct 18 11:54:45 sylla kernel: ata1: softreset failed (device not ready) Oct 18 11:54:45 sylla kernel: ata1: hard resetting port Oct 18 11:54:46 sylla kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Oct 18 11:54:46 sylla kernel: ata1.00: configured for UDMA/133 Oct 18 11:54:46 sylla kernel: ata1: EH complete Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] 976773168 512-byte hardware sectors (500108 MB) Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Write Protect is off Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Thanks, - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I've seen something similiar, it turned out to be a bad disk. I've also seen it when the cable was loose. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow raid5 performance
On Tue, 23 Oct 2007, Richard Scobie wrote: Peter wrote: Thanks Justin, good to hear about some real world experience. Hi Peter, I recently built a 3 drive RAID5 using the onboard SATA controllers on an MCP55 based board and get around 115MB/s write and 141MB/s read. A fourth drive was added some time later and after growing the array and filesystem (XFS), saw 160MB/s write and 178MB/s read, with the array 60% full. Regards, Richard - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Yes, your chipset must be PCI-e based and not PCI. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software RAID when it works and when it doesn't
On Sat, 20 Oct 2007, Michael Tokarev wrote: There was an idea some years ago about having an additional layer on between a block device and whatever else is above it (filesystem or something else), that will just do bad block remapping. Maybe it was even implemented in LVM or IBM-proposed EVMS (the version that included in-kernel stuff too, not only the userspace management), but I don't remember details anymore. In any case, - but again, if memory serves me right, -- there was low interest in that because of exactly this -- drives are now more intelligent, there's hardly a notion of bad block anymore, at least persistent bad block, -- at least visible to the upper layers. /mjt When I run 3dm2 (3ware 3dm2/tools/daemon) I often see LBA remapped sector, success, etc.. My question is, how come I do not see this with mdadm/software raid? Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, Doug Ledford wrote: On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote: I'm sure an internal bitmap would. On RAID1 arrays, reads/writes are never split up by a chunk size for stripes. A 2mb read is a single read, where as on a raid4/5/6 array, a 2mb read will end up hitting a series of stripes across all disks. That means that on raid1 arrays, total disk seeks total reads/writes, where as on a raid4/5/6, total disk seeks is usually total reads/writes. That in turn implies that in a raid1 setup, disk seek time is important to performance, but not necessarily paramount. For raid456, disk seek time is paramount because of how many more seeks that format uses. When you then use an internal bitmap, you are adding writes to every member of the raid456 array, which adds more seeks. The same is true for raid1, but since raid1 doesn't have the same level of dependency on seek rates that raid456 has, it doesn't show the same performance hit that raid456 does. Got it, so for RAID1 it would make sense if LILO supported it (the later versions of the md superblock) Lilo doesn't know anything about the superblock format, however, lilo expects the raid1 device to start at the beginning of the physical partition. In otherwords, format 1.0 would work with lilo. Did not work when I tried 1.x with LILO, switched back to 00.90.03 and it worked fine. (for those who use LILO) but for RAID4/5/6, keep the bitmaps away :) I still use an internal bitmap regardless ;-) To help mitigate the cost of seeks on raid456, you can specify a huge chunk size (like 256k to 2MB or somewhere in that range). As long as you can get 90%+ of your reads/writes to fall into the space of a single chunk, then you start performing more like a raid1 device without the extra seek overhead. Of course, this comes at the expense of peak throughput on the device. Let's say you were building a mondo movie server, where you were streaming out digital movie files. In that case, you very well may care more about throughput than seek performance since I suspect you wouldn't have many small, random reads. Then I would use a small chunk size, sacrifice the seek performance, and get the throughput bonus of parallel reads from the same stripe on multiple disks. On the other hand, if I was setting up a mail server then I would go with a large chunk size because the filesystem activities themselves are going to produce lots of random seeks, and you don't want your raid setup to make that problem worse. Plus, most mail doesn't come in or go out at any sort of massive streaming speed, so you don't need the paralllel reads from multiple disks to perform well. It all depends on your particular use scenario. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, Doug Ledford wrote: On Fri, 2007-10-19 at 12:45 -0400, Justin Piszcz wrote: On Fri, 19 Oct 2007, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin Is a bitmap created by default with 1.x? I remember seeing Justin reports of 15-30% performance degradation using a bitmap on a Justin RAID5 with 1.x. Not according to the mdadm man page. I'd probably give up that performance if it meant that re-syncing an array went much faster after a crash. I certainly use it on my RAID1 setup on my home machine. John The performance AFTER a crash yes, but in general usage I remember seeing someone here doing benchmarks it had a negative affect on performance. I'm sure an internal bitmap would. On RAID1 arrays, reads/writes are never split up by a chunk size for stripes. A 2mb read is a single read, where as on a raid4/5/6 array, a 2mb read will end up hitting a series of stripes across all disks. That means that on raid1 arrays, total disk seeks total reads/writes, where as on a raid4/5/6, total disk seeks is usually total reads/writes. That in turn implies that in a raid1 setup, disk seek time is important to performance, but not necessarily paramount. For raid456, disk seek time is paramount because of how many more seeks that format uses. When you then use an internal bitmap, you are adding writes to every member of the raid456 array, which adds more seeks. The same is true for raid1, but since raid1 doesn't have the same level of dependency on seek rates that raid456 has, it doesn't show the same performance hit that raid456 does. Justin. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband Got it, so for RAID1 it would make sense if LILO supported it (the later versions of the md superblock) (for those who use LILO) but for RAID4/5/6, keep the bitmaps away :) Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, John Stoffel wrote: Doug == Doug Ledford [EMAIL PROTECTED] writes: Doug On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? Doug 1.0, 1.1, and 1.2 are the same format, just in different positions on Doug the disk. Of the three, the 1.1 format is the safest to use since it Doug won't allow you to accidentally have some sort of metadata between the Doug beginning of the disk and the raid superblock (such as an lvm2 Doug superblock), and hence whenever the raid array isn't up, you won't be Doug able to accidentally mount the lvm2 volumes, filesystem, etc. (In worse Doug case situations, I've seen lvm2 find a superblock on one RAID1 array Doug member when the RAID1 array was down, the system came up, you used the Doug system, the two copies of the raid array were made drastically Doug inconsistent, then at the next reboot, the situation that prevented the Doug RAID1 from starting was resolved, and it never know it failed to start Doug last time, and the two inconsistent members we put back into a clean Doug array). So, deprecating any of these is not really helpful. And you Doug need to keep the old 0.90 format around for back compatibility with Doug thousands of existing raid arrays. This is a great case for making the 1.1 format be the default. So what are the advantages of the 1.0 and 1.2 formats then? Or should be we thinking about making two copies of the data on each RAID member, one at the beginning and one at the end, for resiliency? I just hate seeing this in the mag page: Declare the style of superblock (raid metadata) to be used. The default is 0.90 for --create, and to guess for other operations. The default can be overridden by setting the metadata value for the CREATE keyword in mdadm.conf. Options are: 0, 0.90, default Use the original 0.90 format superblock. This format limits arrays to 28 component devices and limits compo- nent devices of levels 1 and greater to 2 terabytes. 1, 1.0, 1.1, 1.2 Use the new version-1 format superblock. This has few restrictions. The different sub-versions store the superblock at different locations on the device, either at the end (for 1.0), at the start (for 1.1) or 4K from the start (for 1.2). It looks to me that the 1.1, combined with the 1.0 should be what we use, with the 1.2 format nuked. Maybe call it 1.3? *grin* So at this point I'm not arguing to get rid of the 0.9 format, though I think it should NOT be the default any more, we should be using the 1.1 combined with 1.0 format. Is a bitmap created by default with 1.x? I remember seeing reports of 15-30% performance degradation using a bitmap on a RAID5 with 1.x. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, Doug Ledford wrote: On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? 1.0, 1.1, and 1.2 are the same format, just in different positions on the disk. Of the three, the 1.1 format is the safest to use since it won't allow you to accidentally have some sort of metadata between the beginning of the disk and the raid superblock (such as an lvm2 superblock), and hence whenever the raid array isn't up, you won't be able to accidentally mount the lvm2 volumes, filesystem, etc. (In worse case situations, I've seen lvm2 find a superblock on one RAID1 array member when the RAID1 array was down, the system came up, you used the system, the two copies of the raid array were made drastically inconsistent, then at the next reboot, the situation that prevented the RAID1 from starting was resolved, and it never know it failed to start last time, and the two inconsistent members we put back into a clean array). So, deprecating any of these is not really helpful. And you need to keep the old 0.90 format around for back compatibility with thousands of existing raid arrays. Agree, what is the benefit in deprecating them? Is there that much old code or? It's certainly easy enough to change mdadm to default to the 1.2 format and to require a --force switch to allow use of the older formats. I keep seeing that we support these old formats, and it's never been clear to me why we have four different ones available? Why can't we start defining the canonical format for Linux RAID metadata? Thanks, John [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of Justin anything else! Are you sure? I find that GRUB is much easier to use and setup than LILO these days. But hey, just dropping down to support 00.09.03 and 1.2 formats would be fine too. Let's just lessen the confusion if at all possible. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? It's certainly easy enough to change mdadm to default to the 1.2 format and to require a --force switch to allow use of the older formats. I keep seeing that we support these old formats, and it's never been clear to me why we have four different ones available? Why can't we start defining the canonical format for Linux RAID metadata? Thanks, John [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of Justin anything else! Are you sure? I find that GRUB is much easier to use and setup than LILO these days. But hey, just dropping down to support 00.09.03 and 1.2 formats would be fine too. Let's just lessen the confusion if at all possible. John I am sure, I submitted a bug report to the LILO developer, he acknowledged the bug but I don't know if it was fixed. I have not tried GRUB with a RAID1 setup yet. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? It's certainly easy enough to change mdadm to default to the 1.2 format and to require a --force switch to allow use of the older formats. I keep seeing that we support these old formats, and it's never been clear to me why we have four different ones available? Why can't we start defining the canonical format for Linux RAID metadata? Thanks, John [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I hope 00.90.03 is not deprecated, LILO cannot boot off of anything else! Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software RAID when it works and when it doesn't
On Fri, 19 Oct 2007, Alberto Alonso wrote: On Thu, 2007-10-18 at 17:26 +0200, Goswin von Brederlow wrote: Mike Accetta [EMAIL PROTECTED] writes: What I would like to see is a timeout driven fallback mechanism. If one mirror does not return the requested data within a certain time (say 1 second) then the request should be duplicated on the other mirror. If the first mirror later unchokes then it remains in the raid, if it fails it gets removed. But (at least reads) should not have to wait for that process. Even better would be if some write delay could also be used. The still working mirror would get an increase in its serial (so on reboot you know one disk is newer). If the choking mirror unchokes then it can write back all the delayed data and also increase its serial to match. Otherwise it gets really failed. But you might have to use bitmaps for this or the cache size would limit its usefullnes. MfG Goswin I think a timeout on both: reads and writes is a must. Basically I believe that all problems that I've encountered issues using software raid would have been resolved by using a timeout within the md code. This will keep a server from crashing/hanging when the underlying driver doesn't properly handle hard drive problems. MD can be smarter than the dumb drivers. Just my thoughts though, as I've never got an answer as to whether or not md can implement its own timeouts. Alberto - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I have a question with re-mapping sectors, can software raid be as efficient or good at remapping bad sectors as an external raid controller for, e.g., raid 10 or raid5? Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin Is a bitmap created by default with 1.x? I remember seeing Justin reports of 15-30% performance degradation using a bitmap on a Justin RAID5 with 1.x. Not according to the mdadm man page. I'd probably give up that performance if it meant that re-syncing an array went much faster after a crash. I certainly use it on my RAID1 setup on my home machine. John The performance AFTER a crash yes, but in general usage I remember seeing someone here doing benchmarks it had a negative affect on performance. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: experiences with raid5: stripe_queue patches
On Mon, 15 Oct 2007, Bernd Schubert wrote: Hi, in order to tune raid performance I did some benchmarks with and without the stripe queue patches. 2.6.22 is only for comparison to rule out other effects, e.g. the new scheduler, etc. It seems there is a regression with these patch regarding the re-write performance, as you can see its almost 50% of what it should be. write re-write read re-read 480844.26 448723.48 707927.55 706075.02 (2.6.22 w/o SQ patches) 487069.47 232574.30 709038.28 707595.09 (2.6.23 with SQ patches) 469865.75 438649.88 711211.92 703229.00 (2.6.23 without SQ patches) Benchmark details: 3xraid5 over 4 partitions of the very same hardware raid (in the end thats raid65, raid6 in hardware and raid5 in software, we need to do that). chunk size: 8192 stripe_cache_size: 8192 each readahead of the md*: 65535 (well actually it limits itself to 65528 readahead of the underlying partitions: 16384 filesystem: xfs Testsystem: 2 x Quadcore Xeon 1.86 GHz (E5320) An interesting effect to notice: Without these patches the pdflush daemons will take a lot of CPU time, with these patches, pdflush almost doesn't appear in the 'top' list. Actually we would prefer one single raid5 array, but then one single raid5 thread will run with 100% CPU time leaving 7 CPUs idle state, the status of the hardware raid says its utilization is only at about 50% and we only see writes at about 200 MB/s. On the contrary, with 3 different software raid5 sets the i/o to the harware raid systems is the bottleneck. Is there any chance to parallize the raid5 code? I think almost everything is done in raid5.c make_request(), but the main loop there is spin_locked by prepare_to_wait(). Would it be possible not to lock this entire loop? Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Excellent questions I look forward to reading this thread :) Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 5: weird size results after Grow
On Sat, 13 Oct 2007, Marko Berg wrote: Bill Davidsen wrote: Marko Berg wrote: I added a fourth drive to a RAID 5 array. After some complications related to adding a new HD controller at the same time, and thus changing some device names, I re-created the array and got it working (in the sense nothing degraded). But size results are weird. Each component partition is 320 G, does anyone have an explanation for the Used Dev Size field value below? The 960 G total size is as it should be, but in practice Linux reports the array only having 625,019,608 blocks. I don't see that number below, what command reported this? For instance df: $ df Filesystem 1K-blocks Used Available Use% Mounted on /dev/md0 625019608 358223356 235539408 61% /usr/pub How can this be, even though the array should be clean with 4 active devices? $ mdadm -D /dev/md0 /dev/md0: Version : 01.02.03 Creation Time : Sat Oct 13 01:25:26 2007 Raid Level : raid5 Array Size : 937705344 (894.27 GiB 960.21 GB) Used Dev Size : 625136896 (298.09 GiB 320.07 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Sat Oct 13 05:11:38 2007 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Name : 0 UUID : 9bf903f8:7fc9eec1:2ff25011:37e9607b Events : 2 Number Major Minor RaidDevice State 0 25320 active sync /dev/VolGroup01/LogVol02 1 8 331 active sync /dev/sdc1 2 8 492 active sync /dev/sdd1 3 8 173 active sync /dev/sdb1 Results for mdadm -E partition on all devices appear like this one, with positions changed: $ mdadm -E /dev/sdc1 /dev/sdc1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 9bf903f8:7fc9eec1:2ff25011:37e9607b Name : 0 Creation Time : Sat Oct 13 01:25:26 2007 Raid Level : raid5 Raid Devices : 4 Used Dev Size : 625137010 (298.09 GiB 320.07 GB) Array Size : 1875410688 (894.27 GiB 960.21 GB) Used Size : 625136896 (298.09 GiB 320.07 GB) Data Offset : 272 sectors Super Offset : 8 sectors State : clean Device UUID : 9b2037fb:231a8ebf:1aaa5577:140795cc Update Time : Sat Oct 13 10:56:02 2007 Checksum : c729f5a1 - correct Events : 2 Layout : left-symmetric Chunk Size : 64K Array Slot : 1 (0, 1, 2, 3) Array State : uUuu Particularly, Used Dev Size and Used Size report an amount twice the size of the partition (and device). Array size is here twice the actual size, even though shown correctly within parentheses. Sectors are 512 bytes. So Used Dev Size above uses sector size, while Array Size uses 1k blocks? I'm pretty sure, though, that previously Used Dev Size was in 1k blocks too. That's also what most of the examples in the net seem to have. Finally, mdstat shows the block count as it should be. $ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 sdb1[3] sdd1[2] sdc1[1] dm-2[0] 937705344 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] [] unused devices: none Any suggestions on how to fix this, or what to investigate next, would be appreciated! I'm not sure what you're trying to fix here, everything you posted looks sane. I'm trying to find the missing 300 GB that, as df reports, are not available. I ought to have a 900 GB array, consisting of four 300 GB devices, while only 600 GB are available. Adding the fourth device didn't increase the capacity of the array (visible, at least). E.g. fdisk reports the array size to be 900 G, but df still claims 600 capacity. Any clues why? -- Marko - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html You have to expand the filesystem. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 5: weird size results after Grow
On Sat, 13 Oct 2007, Marko Berg wrote: Corey Hickey wrote: Marko Berg wrote: Bill Davidsen wrote: Marko Berg wrote: Any suggestions on how to fix this, or what to investigate next, would be appreciated! I'm not sure what you're trying to fix here, everything you posted looks sane. I'm trying to find the missing 300 GB that, as df reports, are not available. I ought to have a 900 GB array, consisting of four 300 GB devices, while only 600 GB are available. Adding the fourth device didn't increase the capacity of the array (visible, at least). E.g. fdisk reports the array size to be 900 G, but df still claims 600 capacity. Any clues why? df reports the size of the filesystem, which is still about 600GB--the filesystem doesn't resize automatically when the size of the underlying device changes. You'll need to use resize2fs, resize_reiserfs, or whatever other tool is appropriate for your type of filesystem. -Corey Right, so this isn't one of my sharpest days... Thanks a bunch, Corey! -- Marko - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Ah, already answered. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 5 performance issue.
On Thu, 11 Oct 2007, Andrew Clayton wrote: On Thu, 11 Oct 2007 13:06:39 -0400, Bill Davidsen wrote: Andrew Clayton wrote: On Fri, 5 Oct 2007 16:56:03 -0400, John Stoffel wrote: Can you start a 'vmstat 1' in one window, then start whatever you do to get crappy performance. That would be interesting to see. In trying to find something simple that can show the problem I'm seeing. I think I may have found the culprit. Just testing on my machine at home, I made this simple program. /* fslattest.c */ #define _GNU_SOURCE #include stdio.h #include stdlib.h #include unistd.h #include sys/stat.h #include sys/types.h #include fcntl.h #include string.h int main(int argc, char *argv[]) { char file[255]; if (argc 2) { printf(Usage: fslattest file\n); exit(1); } strncpy(file, argv[1], 254); printf(Opening %s\n, file); while (1) { int testfd = open(file, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600); close(testfd); unlink(file); sleep(1); } exit(0); } If I run this program under strace in my home directory (XFS file system on a (new) disk (no raid involved) all to its own.like $ strace -T -e open ./fslattest test It doesn't looks too bad. open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.005043 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000212 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.016844 If I then start up a dd in the same place. $ dd if=/dev/zero of=bigfile bs=1M count=500 Then I see the problem I'm seeing at work. open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 2.000348 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 1.594441 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 2.224636 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 1.074615 Doing the same on my other disk which is Ext3 and contains the root fs, it doesn't ever stutter open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.015423 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.92 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.93 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.88 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000103 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.96 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.94 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000114 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.91 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000274 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000107 Somewhere in there was the dd, but you can't tell. I've found if I mount the XFS filesystem with nobarrier, the latency is reduced to about 0.5 seconds with occasional spikes 1 second. When doing this on the raid array. open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.009164 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.71 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.002667 dd kicks in open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 11.580238 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 3.94 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.63 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 4.297978 dd finishes open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.000199 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.013413 open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.025134 I guess I should take this to the XFS folks. Try mounting the filesystem noatime and see if that's part of the problem. Yeah, it's mounted noatime. Looks like I tracked this down to an XFS regression. http://marc.info/?l=linux-fsdevelm=119211228609886w=2 Cheers, Andrew - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Nice! Thanks for reporting the final result, 1-2 weeks of debugging/discussion, nice you found it. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 5 performance issue.
On Sun, 7 Oct 2007, Dean S. Messing wrote: Justin Piszcz wrote: On Fri, 5 Oct 2007, Dean S. Messing wrote: Brendan Conoboy wrote: snip Is the onboard SATA controller real SATA or just an ATA-SATA converter? If the latter, you're going to have trouble getting faster performance than any one disk can give you at a time. The output of 'lspci' should tell you if the onboard SATA controller is on its own bus or sharing space with some other device. Pasting the output here would be useful. snip N00bee question: How does one tell if a machine's disk controller is an ATA-SATA converter? The output of `lspci|fgrep -i sata' is: 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI Controller\ (rev 09) suggests a real SATA. These references to ATA in dmesg, however, make me wonder. ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: ATA-7: WDC WD1600JS-75NCB3, 10.02E04, max UDMA/133 ata1.00: 31250 sectors, multi 0: LBA48 NCQ (depth 31/32) ata1.00: configured for UDMA/133 ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata2.00: ATA-7: ST3160812AS, 3.ADJ, max UDMA/133 ata2.00: 31250 sectors, multi 0: LBA48 NCQ (depth 31/32) ata2.00: configured for UDMA/133 ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata3.00: ATA-7: ST3500630NS, 3.AEK, max UDMA/133 ata3.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32) ata3.00: configured for UDMA/133 Dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html His drives are either really old and do not support NCQ or he is not using AHCI in the BIOS. Sorry, Justin, if I wasn't clear. I was asking the N00bee question about _my_own_ machine. The output of lspci (on my machine) seems to indicate I have a real STAT controller on the Motherboard, but the contents of dmesg, with the references to ATA-7 and UDMA/133, made me wonder if I had just an ATA-SATA converter. Hence my question: how does one tell definitively if one has a real SATA controller on the Mother Board? The output looks like a real (AHCI-capable) SATA controller and your drives are using NCQ/AHCI. Output from one of my machines: [ 23.621462] ata1: SATA max UDMA/133 cmd 0xf8812100 ctl 0x bmdma 0x irq 219 [ 24.078390] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 24.549806] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) As far as why it shows UDMA/133 in the kernel output I am sure there is a reason :) I know in the older SATA drives there was a bridge chip that was used to convert the drive from IDE-SATA maybe it is from those legacy days, not sure. With the newer NCQ/'native' SATA drives, the bridge chip should no longer exist. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very degraded RAID5, or increasing capacity by adding discs
On Mon, 8 Oct 2007, Janek Kozicki wrote: Hello, Recently I started to use mdadm and I'm very impressed by its capabilities. I have raid0 (250+250 GB) on my workstation. And I want to have raid5 (4*500 = 1500 GB) on my backup machine. The backup machine currently doesn't have raid, just a single 500 GB drive. I plan to buy more HDDs to have a bigger space for my backups but since I cannot afford all HDDs at once I face a problem of expanding an array. I'm able to add one 500 GB drive every few months until I have all 4 drives. But I cannot make a backup of a backup... so reformatting/copying all data each time when I add new disc to the array is not possible for me. Is it possible anyhow to create a very degraded raid array - a one that consists of 4 drives, but has only TWO ? This would involve some very tricky *hole* management on the block device... A one that places holes in stripes on the block device, until more discs are added to fill the holes. When the holes are filled, the block device grows bigger, and with lvm I just increase the filesystem size. This is perhaps coupled with some unstripping that moves/reorganizes blocks around to fill/defragment the holes. is it just a pipe dream? best regards PS: yes it's simple to make a degraded array of 3 drives, but I cannot afford two discs at once... -- Janek Kozicki | - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html With raid1 you can create a degraded array with 1 disk- I have done this, I have always wondered if mdadm will let you make a degraded raid 5 array with 2 disks (you'd specify 3 and only give 2) - you can always expand later. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html