from:"Justin Piszcz"

Re: mismatch_cnt != 0

2008-02-25 Thread Justin Piszcz




On Sun, 24 Feb 2008, Janek Kozicki wrote:


Justin Piszcz said: (by the date of Sun, 24 Feb 2008 04:26:39 -0500 (EST))


Kernel 2.6.24.2 I've seen it on different occasions, for this last time
though it may have been due to a power outage that lasted  2hours and
obviously the UPS did not hold up that long.


you should connect UPS through RS-232 or USB, and if a power-down
event is detected - issue hibernate or shutdown. Currently I am
issuing hibernate in this case, works pretty well for 2.6.22 and up.

--
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I have it hooked up but it was a weird day for the power going on and off 
many times for upwards of 2-3 hours and then it died for 2+ hours.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: board/controller recommendations?

2008-02-25 Thread Justin Piszcz




On Mon, 25 Feb 2008, Dexter Filmore wrote:


Currently my array consists of four Samsung Spinpoint sATA drives, I'm about
to enlarge to 6 drive.
As of now they sit on an Sil3114 controller via PCI, hence there's a
bottleneck, can't squeeze more than 15-30 megs write speed (rather 15 today
as the xfs partitions on it are brim full and started fragmenting).

Now, I'd like to go for a AMD board with 6 sATA channels connected via PCIe -
can someone recomend a board here? Preferrably AMD 690 based so I won't need
a video card or similar.

Dex

--
-BEGIN GEEK CODE BLOCK-
Version: 3.12
GCS d--(+)@ s-:+ a- C UL++ P+++ L+++ E-- W++ N o? K-
w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@
b++(+++) DI+++ D- G++ e* h++ r* y?
--END GEEK CODE BLOCK--

http://www.vorratsdatenspeicherung.de
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



That's always the question, which mobo?  I went Intel as many of their 
chipsets (965, p35, x38) have 6 SATA, I am sure AMD have some as well 
though, what I bought awhile back was a 6 port sata w/ 3 pci-e x1 and 1 
pci-e x16.  Then you buy the 2 port sata cards (x1) and plugin your 
drives.


Promise also came out with a 4 port PCI-e x1 card but I have not tried it, 
seen any reviews for it and do not know if it is even supported in linux.


Also, I'd recommend you run a check/resync on your array before removing 
it from your current box, and then make sure the two new drives do not 
have any problems, and (to be safe?) expand by adding 1 drive at a time?


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: board/controller recommendations?

2008-02-25 Thread Justin Piszcz




On Mon, 25 Feb 2008, Dexter Filmore wrote:


On Monday 25 February 2008 15:02:31 Justin Piszcz wrote:

On Mon, 25 Feb 2008, Dexter Filmore wrote:

Currently my array consists of four Samsung Spinpoint sATA drives, I'm
about to enlarge to 6 drive.
As of now they sit on an Sil3114 controller via PCI, hence there's a
bottleneck, can't squeeze more than 15-30 megs write speed (rather 15
today as the xfs partitions on it are brim full and started fragmenting).

Now, I'd like to go for a AMD board with 6 sATA channels connected via
PCIe - can someone recomend a board here? Preferrably AMD 690 based so I
won't need a video card or similar.

Dex

--
-BEGIN GEEK CODE BLOCK-
Version: 3.12
GCS d--(+)@ s-:+ a- C UL++ P+++ L+++ E-- W++ N o? K-
w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@
b++(+++) DI+++ D- G++ e* h++ r* y?
--END GEEK CODE BLOCK--

http://www.vorratsdatenspeicherung.de
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


That's always the question, which mobo?  I went Intel as many of their
chipsets (965, p35, x38) have 6 SATA, I am sure AMD have some as well
though, what I bought awhile back was a 6 port sata w/ 3 pci-e x1 and 1
pci-e x16.  Then you buy the 2 port sata cards (x1) and plugin your
drives.


Intel means big bucks since I'd need an intel cpu, too. Cheapest lga775 would
be around 90 euros where I get a midrange amd x2 at 50-60.



Promise also came out with a 4 port PCI-e x1 card but I have not tried it,
seen any reviews for it and do not know if it is even supported in linux.


Now *that's* Promis-ing (huh huh) - happen to know the model name?

http://www.newegg.com/Product/Product.aspx?Item=N82E16816102117
Type SATA / SAS





Also, I'd recommend you run a check/resync on your array before removing
it from your current box, and then make sure the two new drives do not
have any problems, and (to be safe?) expand by adding 1 drive at a time?


Neil Brown told me to expand 2 drives at once, but I'll back up the array
anyway to be safe and simply recreate. I guess selling the 750gig drive at
ebay with 5 bucks off should do :)



--
-BEGIN GEEK CODE BLOCK-
Version: 3.12
GCS d--(+)@ s-:+ a- C UL++ P+++ L+++ E-- W++ N o? K-
w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@
b++(+++) DI+++ D- G++ e* h++ r* y?
--END GEEK CODE BLOCK--

http://www.vorratsdatenspeicherung.de


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: board/controller recommendations?

2008-02-25 Thread Justin Piszcz




On Mon, 25 Feb 2008, Dexter Filmore wrote:


On Monday 25 February 2008 19:50:52 Justin Piszcz wrote:

On Mon, 25 Feb 2008, Dexter Filmore wrote:

On Monday 25 February 2008 15:02:31 Justin Piszcz wrote:

On Mon, 25 Feb 2008, Dexter Filmore wrote:

Currently my array consists of four Samsung Spinpoint sATA drives, I'm
about to enlarge to 6 drive.
As of now they sit on an Sil3114 controller via PCI, hence there's a
bottleneck, can't squeeze more than 15-30 megs write speed (rather 15
today as the xfs partitions on it are brim full and started
fragmenting).

Now, I'd like to go for a AMD board with 6 sATA channels connected via
PCIe - can someone recomend a board here? Preferrably AMD 690 based so
I won't need a video card or similar.

Dex

--
-BEGIN GEEK CODE BLOCK-
Version: 3.12
GCS d--(+)@ s-:+ a- C UL++ P+++ L+++ E-- W++ N o? K-
w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@
b++(+++) DI+++ D- G++ e* h++ r* y?
--END GEEK CODE BLOCK--

http://www.vorratsdatenspeicherung.de
-
To unsubscribe from this list: send the line unsubscribe linux-raid
in the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


That's always the question, which mobo?  I went Intel as many of their
chipsets (965, p35, x38) have 6 SATA, I am sure AMD have some as well
though, what I bought awhile back was a 6 port sata w/ 3 pci-e x1 and 1
pci-e x16.  Then you buy the 2 port sata cards (x1) and plugin your
drives.


Intel means big bucks since I'd need an intel cpu, too. Cheapest lga775
would be around 90 euros where I get a midrange amd x2 at 50-60.


Promise also came out with a 4 port PCI-e x1 card but I have not tried
it, seen any reviews for it and do not know if it is even supported in
linux.


Now *that's* Promis-ing (huh huh) - happen to know the model name?


http://www.newegg.com/Product/Product.aspx?Item=N82E16816102117
Type SATA / SAS


Full blown raid 50 controller. A tad overkill-ish for softraid.
I just came across this one:

http://geizhals.at/deutschland/a254413.html

One would have to have a board featuring pcie 4x or 1x mechanically open at
the end.
Then again, there's this board:

http://geizhals.at/deutschland/a244789.html

If that controller runs in Linux those two would make a nice combo. Just saw
Adaptec provides open src drivers for Linux, so chances are it's included or
at least scheduled.


Yeah I heard there are major problems with those (adaptec boards), that is why 
I went with the open source 2 port sata pci-e cards, work like a charm.


Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mismatch_cnt != 0

2008-02-24 Thread Justin Piszcz




On Sat, 23 Feb 2008, Carlos Carvalho wrote:


Justin Piszcz ([EMAIL PROTECTED]) wrote on 23 February 2008 10:44:


On Sat, 23 Feb 2008, Justin Piszcz wrote:



 On Sat, 23 Feb 2008, Michael Tokarev wrote:

 Justin Piszcz wrote:
 Should I be worried?

 Fri Feb 22 20:00:05 EST 2008: Executing RAID health check for /dev/md3...
 Fri Feb 22 21:00:06 EST 2008: cat /sys/block/md3/md/mismatch_cnt
 Fri Feb 22 21:00:06 EST 2008: 936
 Fri Feb 22 21:00:09 EST 2008: Executing repair on /dev/md3
 Fri Feb 22 22:00:10 EST 2008: cat /sys/block/md3/md/mismatch_cnt
 Fri Feb 22 22:00:10 EST 2008: 936

 Your /dev/md3 is a swap, right?
 If it's swap, it's quite common to see mismatches here.  I don't know
 why, and I don't think it's correct (there should be a bug somewhere).
 If it's not swap, there should be no mismatches, UNLESS you initially
 built your array with --assume-clean.
 In any case it's good to understand where those mismatches comes from
 in the first place.

 As of the difference (or, rather, lack thereof) of the mismatched
 blocks after check and repair - that's exactly what expected.  Check
 found 936 mismatches, and repair corrected exactly the same amount
 of them.  I.e., if you run check again after repair, you should see
 0 mismatches.

 /mjt


 My /dev/md3 is my main RAID 5 partition.  Even after repair, it showed 936, I
 will re-run repair.  Also, I did not build my array with --assume-clean and I
 run my check  array once a week.


The only situation where there could be mismatches on a clean array is
if you created it with --assume-clean. After a repair, a check should
give zero mismatches, without reboot.

Of course I'm supposing your hardware is working without glitches...

After a reboot  check, it is back to 0-- interesting..

Looks like a bug... Which kernel version?


Kernel 2.6.24.2 I've seen it on different occasions, for this last time 
though it may have been due to a power outage that lasted  2hours and 
obviously the UPS did not hold up that long.


Will keep an eye on this to see if any additional mismatches show up.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How many drives are bad?

2008-02-19 Thread Justin Piszcz


How many drives actually failed?

Failed Devices : 1



On Tue, 19 Feb 2008, Norman Elton wrote:


So I had my first failure today, when I got a report that one drive
(/dev/sdam) failed. I've attached the output of mdadm --detail. It
appears that two drives are listed as removed, but the array is
still functioning. What does this mean? How many drives actually
failed?

This is all a test system, so I can dink around as much as necessary.
Thanks for any advice!

Norman Elton

== OUTPUT OF MDADM =

   Version : 00.90.03
 Creation Time : Fri Jan 18 13:17:33 2008
Raid Level : raid5
Array Size : 6837319552 (6520.58 GiB 7001.42 GB)
   Device Size : 976759936 (931.51 GiB 1000.20 GB)
  Raid Devices : 8
 Total Devices : 7
Preferred Minor : 4
   Persistence : Superblock is persistent

   Update Time : Mon Feb 18 11:49:13 2008
 State : clean, degraded
Active Devices : 6
Working Devices : 6
Failed Devices : 1
 Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

  UUID : b16bdcaf:a20192fb:39c74cb8:e5e60b20
Events : 0.110

   Number   Major   Minor   RaidDevice State
  0  6610  active sync   /dev/sdag1
  1  66   171  active sync   /dev/sdah1
  2  66   332  active sync   /dev/sdai1
  3  66   493  active sync   /dev/sdaj1
  4  66   654  active sync   /dev/sdak1
  5   005  removed
  6   006  removed
  7  66  1137  active sync   /dev/sdan1

  8  66   97-  faulty spare   /dev/sdam1
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How many drives are bad?

2008-02-19 Thread Justin Piszcz


Neil,

Is this a bug?

Also, I have a question for Norman-- how come your drives are sda[a-z]1? 
Typically it is /dev/sda1 /dev/sdb1 etc?


Justin.

On Tue, 19 Feb 2008, Norman Elton wrote:

But why do two show up as removed?? I would expect /dev/sdal1 to show up 
someplace, either active or failed.


Any ideas?

Thanks,

Norman



On Feb 19, 2008, at 12:31 PM, Justin Piszcz wrote:


How many drives actually failed?

Failed Devices : 1



On Tue, 19 Feb 2008, Norman Elton wrote:


So I had my first failure today, when I got a report that one drive
(/dev/sdam) failed. I've attached the output of mdadm --detail. It
appears that two drives are listed as removed, but the array is
still functioning. What does this mean? How many drives actually
failed?

This is all a test system, so I can dink around as much as necessary.
Thanks for any advice!

Norman Elton

== OUTPUT OF MDADM =

 Version : 00.90.03
Creation Time : Fri Jan 18 13:17:33 2008
  Raid Level : raid5
  Array Size : 6837319552 (6520.58 GiB 7001.42 GB)
 Device Size : 976759936 (931.51 GiB 1000.20 GB)
Raid Devices : 8
Total Devices : 7
Preferred Minor : 4
 Persistence : Superblock is persistent

 Update Time : Mon Feb 18 11:49:13 2008
   State : clean, degraded
Active Devices : 6
Working Devices : 6
Failed Devices : 1
Spare Devices : 0

  Layout : left-symmetric
  Chunk Size : 64K

UUID : b16bdcaf:a20192fb:39c74cb8:e5e60b20
  Events : 0.110

 Number   Major   Minor   RaidDevice State
0  6610  active sync   /dev/sdag1
1  66   171  active sync   /dev/sdah1
2  66   332  active sync   /dev/sdai1
3  66   493  active sync   /dev/sdaj1
4  66   654  active sync   /dev/sdak1
5   005  removed
6   006  removed
7  66  1137  active sync   /dev/sdan1

8  66   97-  faulty spare   /dev/sdam1
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How many drives are bad?

2008-02-19 Thread Justin Piszcz


Norman,

I am extremely interested in what distribution you are running on it and 
what type of SW raid you are employing (besides the one you showed here), 
are all 48 drives filled, or?


Justin.

On Tue, 19 Feb 2008, Norman Elton wrote:


Justin,

This is a Sun X4500 (Thumper) box, so it's got 48 drives inside.
/dev/sd[a-z] are all there as well, just in other RAID sets. Once you
get to /dev/sdz, it starts up at /dev/sdaa, sdab, etc.

I'd be curious if what I'm experiencing is a bug. What should I try to
restore the array?

Norman

On 2/19/08, Justin Piszcz [EMAIL PROTECTED] wrote:

Neil,

Is this a bug?

Also, I have a question for Norman-- how come your drives are sda[a-z]1?
Typically it is /dev/sda1 /dev/sdb1 etc?

Justin.

On Tue, 19 Feb 2008, Norman Elton wrote:


But why do two show up as removed?? I would expect /dev/sdal1 to show up
someplace, either active or failed.

Any ideas?

Thanks,

Norman



On Feb 19, 2008, at 12:31 PM, Justin Piszcz wrote:


How many drives actually failed?

Failed Devices : 1



On Tue, 19 Feb 2008, Norman Elton wrote:


So I had my first failure today, when I got a report that one drive
(/dev/sdam) failed. I've attached the output of mdadm --detail. It
appears that two drives are listed as removed, but the array is
still functioning. What does this mean? How many drives actually
failed?

This is all a test system, so I can dink around as much as necessary.
Thanks for any advice!

Norman Elton

== OUTPUT OF MDADM =

 Version : 00.90.03
Creation Time : Fri Jan 18 13:17:33 2008
  Raid Level : raid5
  Array Size : 6837319552 (6520.58 GiB 7001.42 GB)
 Device Size : 976759936 (931.51 GiB 1000.20 GB)
Raid Devices : 8
Total Devices : 7
Preferred Minor : 4
 Persistence : Superblock is persistent

 Update Time : Mon Feb 18 11:49:13 2008
   State : clean, degraded
Active Devices : 6
Working Devices : 6
Failed Devices : 1
Spare Devices : 0

  Layout : left-symmetric
  Chunk Size : 64K

UUID : b16bdcaf:a20192fb:39c74cb8:e5e60b20
  Events : 0.110

 Number   Major   Minor   RaidDevice State
0  6610  active sync   /dev/sdag1
1  66   171  active sync   /dev/sdah1
2  66   332  active sync   /dev/sdai1
3  66   493  active sync   /dev/sdaj1
4  66   654  active sync   /dev/sdak1
5   005  removed
6   006  removed
7  66  1137  active sync   /dev/sdan1

8  66   97-  faulty spare   /dev/sdam1
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html





-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: HDD errors in dmesg, but don't know why...

2008-02-18 Thread Justin Piszcz

Looks like your replacement disk is no good, the SATA port is bad or other 
issue.  I am not sure what SDB FIS means but as long as you keep getting 
that error, don't expect the drive to work correctly, I had a drive that 
did a similar thing (DOA Raptor) and after I got the replacement it worked 
fine.  However, like I said, I am not sure what that error means SDB FIS.


On Mon, 18 Feb 2008, Steve Fairbairn wrote:



Hi All,

I've got a degraded RAID5 which I'm trying to add in the replacement
disk.  Trouble is, every time the recovery starts, it flies along at
70MB/s or so.  Then after doing about 1%, it starts dropping rapidly,
until eventually a device is marked failed.

When I look in dmesg, I get the following...

SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: drive cache: write back
ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0
ata5.00: (irq_stat 0x00060002, device error via SDB FIS)
ata5.00: cmd 60/00:10:3f:0e:f9/01:00:00:00:00/40 tag 2 cdb 0x0 data
131072 in
res 41/40:00:50:0e:f9/9c:00:00:00:00/40 Emask 0x9 (media error)
ata5.00: configured for UDMA/100
ata5: EH complete
SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: drive cache: write back
ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0
ata5.00: (irq_stat 0x00060002, device error via SDB FIS)
ata5.00: cmd 60/00:18:3f:02:f9/01:00:00:00:00/40 tag 3 cdb 0x0 data
131072 in
res 41/40:00:c3:02:f9/9c:00:00:00:00/40 Emask 0x9 (media error)
ata5.00: configured for UDMA/100
ata5: EH complete
SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: drive cache: write back
ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0
ata5.00: (irq_stat 0x00060002, device error via SDB FIS)
ata5.00: cmd 60/00:10:3f:0e:f9/01:00:00:00:00/40 tag 2 cdb 0x0 data
131072 in
res 41/40:00:50:0e:f9/9c:00:00:00:00/40 Emask 0x9 (media error)
ata5.00: configured for UDMA/100
ata5: EH complete
SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: drive cache: write back
ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0
ata5.00: (irq_stat 0x00060002, device error via SDB FIS)
ata5.00: cmd 60/00:18:3f:02:f9/01:00:00:00:00/40 tag 3 cdb 0x0 data
131072 in
res 41/40:00:c3:02:f9/9c:00:00:00:00/40 Emask 0x9 (media error)
ata5.00: configured for UDMA/100
ata5: EH complete
SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: drive cache: write back
ata5.00: exception Emask 0x0 SAct 0x3ff SErr 0x0 action 0x0
ata5.00: (irq_stat 0x00060002, device error via SDB FIS)
ata5.00: cmd 60/00:10:3f:0e:f9/01:00:00:00:00/40 tag 2 cdb 0x0 data
131072 in
res 41/40:00:50:0e:f9/9c:00:00:00:00/40 Emask 0x9 (media error)
ata5.00: configured for UDMA/100
ata5: EH complete
SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: drive cache: write back

I've no idea what to make of these errors.  As far as I can work out,
the HD's themselves are fine They are all less than 2 months old.

The box is CentOS 5.1.  Linux space.homenet.com 2.6.18-53.1.13.el5 #1
SMP Tue Feb 12 13:02:30 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

Any suggestions on what I can do to stop this issue?

Steve.

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.516 / Virus Database: 269.20.7/1284 - Release Date:
17/02/2008 14:39


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID5 how chage chunck size from 64 to 128, 256 ? is it possible ?

2008-02-09 Thread Justin Piszcz

When you reate the array its --chunk or -c -- I found 256 KiB to 1024 KiB 
to be optimal.


Justin.

On Sat, 9 Feb 2008, Andreas-Sokov wrote:


Hi linux-raid.

RAID5 how chage chunck size from 64 to 128, 256 ?
is it possible ?
Somebody did this ?

--
Best regards,
Andreas-Sokov

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Any inexpensive hardware recommendations for PCI interface cards?

2008-02-08 Thread Justin Piszcz




On Fri, 8 Feb 2008, Iustin Pop wrote:


On Fri, Feb 08, 2008 at 08:54:55AM -0500, Justin Piszcz wrote:

The promise tx4 pci works great and supports sata/300+ncq/etc $60-$70.


Wait, I have used tx4 pci up until ~2.6.22 and it didn't support AFAIK
ncq. Are you sure that current driver supports NCQ? I might then revive
that card :)

thanks,
iustin



Whoa nice catch, I meant the Promise 300 TX4 which now retails for $59.99 
w/free ship.


http://www.newegg.com/Product/Product.aspx?Item=N82E16816102062

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Any inexpensive hardware recommendations for PCI interface cards?

2008-02-08 Thread Justin Piszcz




On Fri, 8 Feb 2008, Iustin Pop wrote:


On Fri, Feb 08, 2008 at 02:24:15PM -0500, Justin Piszcz wrote:



On Fri, 8 Feb 2008, Iustin Pop wrote:


On Fri, Feb 08, 2008 at 08:54:55AM -0500, Justin Piszcz wrote:

The promise tx4 pci works great and supports sata/300+ncq/etc $60-$70.


Wait, I have used tx4 pci up until ~2.6.22 and it didn't support AFAIK
ncq. Are you sure that current driver supports NCQ? I might then revive
that card :)

thanks,
iustin



Whoa nice catch, I meant the Promise 300 TX4 which now retails for $59.99
w/free ship.

http://www.newegg.com/Product/Product.aspx?Item=N82E16816102062


:)

Actually, I exactly meant Promise 300 TX4 (the board is in my hand: chip
says PDC40718). The HW supports NCQ, but the linux sata_promise driver
didn't support NCQ when I tested it. Can someone confirm it does today
(2.6.24) NCQ?

iustin



I used the board with a Seagate 400GiB/NCQ drive and I recall seeing Port 
Up 3.0Gbps/NCQ (31/32) within the scrolling text upon boot--

but it was awhile ago.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Any inexpensive hardware recommendations for PCI interface cards?

2008-02-08 Thread Justin Piszcz




On Fri, 8 Feb 2008, Bill Davidsen wrote:


Steve Fairbairn wrote:

Can anyone see any issues with what I'm trying to do?



No.


Are there any known issues with IT8212 cards (They worked as straight
disks on linux fine)?



No idea, don't have that card.


Is anyone using an array with disks on PCI interface cards?



Works. I've mixed PATA, SATA, onboard, PCI, and firewire (lack of controllers 
is the mother of invention). As long as the device under the raid works, the 
raid should work.



Is there an issue with mixing motherboard interfaces and PCI card based
ones?



Not that I've found.


Does anyone recommend any inexpensive (probably SATA-II) PCI interface
cards?



Not I. Large drives have have cured me of FrankenRAID setups recently, other 
than to build little arrays out of USB devices for backup.


--
Bill Davidsen [EMAIL PROTECTED]
Woe unto the statesman who makes war without a reason that will still
be valid when the war is over... Otto von Bismark 


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



The promise tx4 pci works great and supports sata/300+ncq/etc $60-$70.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Justin Piszcz




On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:


On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote:

On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote:

Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 +0100)



All the raid10's will have double time for writing, and raid5 and raid6
will also have double or triple writing times, given that you can do
striped writes on the raid0.


For raid5 and raid6 I think this is even worse. My take is that for
raid5 when you write something, you first read the chunk data involved,
then you read the parity data, then you xor-subtract the data to be
changed, and you xor-add the new data, and then write the new data chunk
and the new parity chunk. In total 2 reads and 2 writes. The read/writes
happen on the same chunks, so latency is minimized. But in essence it is
still 4 IO operations, where it is only 2 writes on raid1/raid10,
that is only half the speed for writing on raid5 compared to raid1/10.

On raid6 this amounts to 6 IO operations, resulting in 1/3 of the
writing speed of raid1/10.

I note in passing that there is no difference between xor-subtract and
xor-add.

Also I assume that you can calculate the parities of both raid5 and
raid6 given the old parities chunks and the old and new data chunk.
If you have to calculate the new parities by reading all the component
data chunks this is going to be really expensive, both in IO and CPU.
For a 10 drive raid5 this would involve reading 9 data chunks, and
making writes 5 times as expensive as raid1/10.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



On my benchmarks RAID5 gave the best overall speed with 10 raptors, 
although I did not play with the various offsets/etc as much as I have 
tweaked the RAID5.


Justin.

Re: recommendations for stripe/chunk size

2008-02-05 Thread Justin Piszcz




On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:


Hi

I am looking at revising our howto. I see a number of places where a
chunk size of 32 kiB is recommended, and even recommendations on
maybe using sizes of 4 kiB.

My own take on that is that this really hurts performance.
Normal disks have a rotation speed of between 5400 (laptop)
7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average
spinning time for one round of 6 to 12 ms, and average latency of half
this, that is 3 to 6 ms. Then you need to add head movement which
is something like 2 to 20 ms - in total average seek time 5 to 26 ms,
averaging around 13-17 ms.

in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133
something like between 600 to 1200 kB, actual transfer rates of
80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck,
and transfer some data you should have something like 256/512 kiB
chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB
giving about a time of 20 ms per transaction
you should be able with random reads to transfer 12 MB/s  - my
actual figures is about 30 MB/s which is possibly because of the
elevator effect of the file system driver. With a size of 4 kb per chunk
you should have a time of 15 ms per transaction, or 66 transactions per
second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up
the transfer by a factor of 50.

I actually  think the kernel should operate with block sizes
like this and not wth 4 kiB blocks. It is the readahead and the elevator
algorithms that save us from randomly reading 4 kb a time.

I also see that there are some memory constrints on this.
Having maybe 1000 processes reading, as for my mirror service,
256 kib buffers would be acceptable, occupying 256 MB RAM.
That is reasonable, and I could even tolerate 512 MB ram used.
But going to 1 MiB buffers would be overdoing it for my configuration.

What would be the recommended chunk size for todays equipment?

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



My benchmarks concluded that 256 KiB to 1024 KiB is optimal, too much 
below or too much over that range results in degradation.


Justin.

Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Justin Piszcz




On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:


On Tue, Feb 05, 2008 at 11:54:27AM -0500, Justin Piszcz wrote:



On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:


On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote:

On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote:

Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07
+0100)



All the raid10's will have double time for writing, and raid5 and raid6
will also have double or triple writing times, given that you can do
striped writes on the raid0.


For raid5 and raid6 I think this is even worse. My take is that for
raid5 when you write something, you first read the chunk data involved,
then you read the parity data, then you xor-subtract the data to be
changed, and you xor-add the new data, and then write the new data chunk
and the new parity chunk. In total 2 reads and 2 writes. The read/writes
happen on the same chunks, so latency is minimized. But in essence it is
still 4 IO operations, where it is only 2 writes on raid1/raid10,
that is only half the speed for writing on raid5 compared to raid1/10.

On raid6 this amounts to 6 IO operations, resulting in 1/3 of the
writing speed of raid1/10.

I note in passing that there is no difference between xor-subtract and
xor-add.

Also I assume that you can calculate the parities of both raid5 and
raid6 given the old parities chunks and the old and new data chunk.
If you have to calculate the new parities by reading all the component
data chunks this is going to be really expensive, both in IO and CPU.
For a 10 drive raid5 this would involve reading 9 data chunks, and
making writes 5 times as expensive as raid1/10.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



On my benchmarks RAID5 gave the best overall speed with 10 raptors,
although I did not play with the various offsets/etc as much as I have
tweaked the RAID5.


Could you give some figures?


I remember testing with bonnie++ and raid10 was about half the speed 
(200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, but input 
was closer to RAID5 speeds/did not seem affected (~550MiB/s).


Justin.

Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Justin Piszcz




On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:


On Tue, Feb 05, 2008 at 05:28:27PM -0500, Justin Piszcz wrote:




Could you give some figures?


I remember testing with bonnie++ and raid10 was about half the speed
(200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, but input
was closer to RAID5 speeds/did not seem affected (~550MiB/s).


Impressive. What levet of raid10 was involved? and what type of
Like I said, it was baseline testing, so pretty much the default raid10 
when you create it via mdadm, I did not mess with offsets, etc.



equipment, how many disks?

Ten 10,000rpm raptors.


Maybe the better output for raid5 could be
due to some striping - AFAIK raid5 will be striping quite well, and
writes almost equal to reading time indicates that the writes are
striping too.

best regards
keld

Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Justin Piszcz




On Mon, 4 Feb 2008, Michael Tokarev wrote:


Moshe Yudkowsky wrote:
[]

If I'm reading the man pages, Wikis, READMEs and mailing lists correctly
--  not necessarily the case -- the ext3 file system uses the equivalent
of data=journal as a default.


ext3 defaults to data=ordered, not data=journal.  ext2 doesn't have
journal at all.


The question then becomes what data scheme to use with reiserfs on the


I'd say don't use reiserfs in the first place ;)


Another way to phrase this: unless you're running data-center grade
hardware and have absolute confidence in your UPS, you should use
data=journal for reiserfs and perhaps avoid XFS entirely.


By the way, even if you do have a good UPS, there should be some
control program for it, to properly shut down your system when
UPS loses the AC power.  So far, I've seen no such programs...

/mjt


Why avoid XFS entirely?

esandeen, any comments here?

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Justin Piszcz




On Mon, 4 Feb 2008, Michael Tokarev wrote:


Eric Sandeen wrote:
[]

http://oss.sgi.com/projects/xfs/faq.html#nulls

and note that recent fixes have been made in this area (also noted in
the faq)

Also - the above all assumes that when a drive says it's written/flushed
data, that it truly has.  Modern write-caching drives can wreak havoc
with any journaling filesystem, so that's one good reason for a UPS.  If


Unfortunately an UPS does not *really* help here.  Because unless
it has control program which properly shuts system down on the loss
of input power, and the battery really has the capacity to power the
system while it's shutting down (anyone tested this?  With new UPS?
and after an year of use, when the battery is not new?), -- unless
the UPS actually has the capacity to shutdown system, it will cut
the power at an unexpected time, while the disk(s) still has dirty
caches...
You use nut and a large enough UPS to handle the load of the system, it 
shuts the machine down just fine.





the drive claims to have metadata safe on disk but actually does not,
and you lose power, the data claimed safe will evaporate, there's not
much the fs can do.  IO write barriers address this by forcing the drive
to flush order-critical data before continuing; xfs has them on by
default, although they are tested at mount time and if you have
something in between xfs and the disks which does not support barriers
(i.e. lvm...) then they are disabled again, with a notice in the logs.


Note also that with linux software raid barriers are NOT supported.

/mjt



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-18 Thread Justin Piszcz




On Fri, 18 Jan 2008, Bill Davidsen wrote:


Justin Piszcz wrote:



On Thu, 17 Jan 2008, Al Boldi wrote:


Justin Piszcz wrote:

On Wed, 16 Jan 2008, Al Boldi wrote:

Also, can you retest using dd with different block-sizes?


I can do this, moment..


I know about oflag=direct but I choose to use dd with sync and measure 
the

total time it takes.
/usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero
of=/r1/bigfile bs=1M count=10240; sync'

So I was asked on the mailing list to test dd with various chunk sizes,
here is the length of time it took
to write 10 GiB and sync per each chunk size:

4=chunk.txt:0:25.46
8=chunk.txt:0:25.63
16=chunk.txt:0:25.26
32=chunk.txt:0:25.08
64=chunk.txt:0:25.55
128=chunk.txt:0:25.26
256=chunk.txt:0:24.72
512=chunk.txt:0:24.71
1024=chunk.txt:0:25.40
2048=chunk.txt:0:25.71
4096=chunk.txt:0:27.18
8192=chunk.txt:0:29.00
16384=chunk.txt:0:31.43
32768=chunk.txt:0:50.11
65536=chunk.txt:2:20.80


What do you get with bs=512,1k,2k,4k,8k,16k...


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



root  4621  0.0  0.0  12404   760 pts/2D+   17:53   0:00 mdadm -S 
/dev/md3

root  4664  0.0  0.0   4264   728 pts/5S+   17:54   0:00 grep D

Tried to stop it when it was re-syncing, DEADLOCK :(

[  305.464904] md: md3 still in use.
[  314.595281] md: md_do_sync() got signal ... exiting

Anyhow, done testing, time to move data back on if I can kill the resync 
process w/out deadlock.


So does that indicate that there is still a deadlock issue, or that you don't 
have the latest patches installed?


--
Bill Davidsen [EMAIL PROTECTED]
Woe unto the statesman who makes war without a reason that will still
be valid when the war is over... Otto von Bismark 



I was trying to stop the raid when it was building, vanilla 2.6.23.14.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-18 Thread Justin Piszcz




On Fri, 18 Jan 2008, Greg Cormier wrote:


Also, don't use ext*, XFS can be up to 2-3x faster (in many of the
benchmarks).


I'm going to swap file systems and give it a shot right now! :)

How is stability of XFS? I heard recovery is easier with ext2/3 due to
more people using it, more tools available, etc?

Greg



Recovery is actually easier with XFS because the repair filesystem code is 
built-into the kernel (you dont need a utility to fix it)-- however, there 
is xfs_repair (if) the in-kernel-tree part could not fix it.


I have been using it for  4-5 years? now.

Also, with CoRaids (ATA over Ethernet) many of them are above 8TB and ext3 
only works up to 8TB, so its not even an option any longer.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-18 Thread Justin Piszcz




On Fri, 18 Jan 2008, Greg Cormier wrote:


Justin, thanks for the script. Here's my results. I ran it a few times
with different tests, hence the small number of results you see here,
I slowly trimmed out the obvious not-ideal sizes.

Nice, we all love benchmarks!! :)



System
---
Athlon64 3500
2GB RAM
4x500GB WD Raid editions, raid 5. SDE is the old 4-platter version
(5000YS), the others are the 3 platter version. Faster :-)

Ok.



/dev/sdb:
Timing buffered disk reads:  240 MB in  3.00 seconds =  79.91 MB/sec
/dev/sdc:
Timing buffered disk reads:  248 MB in  3.01 seconds =  82.36 MB/sec
/dev/sdd:
Timing buffered disk reads:  248 MB in  3.02 seconds =  82.22 MB/sec
/dev/sde:  (older model, 4 platters instead of 3)
Timing buffered disk reads:  210 MB in  3.01 seconds =  69.87 MB/sec
/dev/md3:
Timing buffered disk reads:  628 MB in  3.00 seconds = 209.09 MB/sec


Testing
---
Test was : dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync
64-chunka.txt:2:00.63
128-chunka.txt:2:00.20
256-chunka.txt:2:01.67
512-chunka.txt:2:19.90
1024-chunka.txt:2:59.32

For your configuration, a 64-256k chunk seems optimal for this, hypothetical
benchmark :)





Test was : Unraring multipart RAR's, 1.2 gigabytes. Source and dest
drive were the raid array.
64-chunkc.txt:1:04.20
128-chunkc.txt:0:49.37
256-chunkc.txt:0:48.88
512-chunkc.txt:0:41.20
1024-chunkc.txt:0:40.82

1 meg looks like its the best, which is what I use today, 1 MiB chunk offers
the best peformance by far, at least with all of my testing (with big files)
such as the tests you performed.





So, there's a toss up between 256 and 512.

Yeah for DD performance, not real-life.


If I'm interpreting
correctly here, raw throughput is better with 256, but 512 seems to
work better with real-world stuff? 

Look above, 1 MiB got you the fastest unrar time.


I'll try to think up another test
or two perhaps, and removing 64 as one of the possible options to save
time (mke2fs takes a while on 1.5TB)
Also, don't use ext*, XFS can be up to 2-3x faster (in many of the 
benchmarks).




Next step will be playing with read aheads and stripe cache sizes I
guess! I'm open to any comments/suggestions you guys have!

Greg


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-16 Thread Justin Piszcz

For these benchmarks I timed how long it takes to extract a standard 4.4 
GiB DVD:


Settings: Software RAID 5 with the following settings (until I change 
those too):


Base setup:
blockdev --setra 65536 /dev/md3
echo 16384  /sys/block/md3/md/stripe_cache_size
echo Disabling NCQ on all disks...
for i in $DISKS
do
  echo Disabling NCQ on $i
  echo 1  /sys/block/$i/device/queue_depth
done

p34:~# grep : *chunk* |sort -n
4-chunk.txt:0:45.31
8-chunk.txt:0:44.32
16-chunk.txt:0:41.02
32-chunk.txt:0:40.50
64-chunk.txt:0:40.88
128-chunk.txt:0:40.21
256-chunk.txt:0:40.14***
512-chunk.txt:0:40.35
1024-chunk.txt:0:41.11
2048-chunk.txt:0:43.89
4096-chunk.txt:0:47.34
8192-chunk.txt:0:57.86
16384-chunk.txt:1:09.39
32768-chunk.txt:1:26.61

It would appear a 256 KiB chunk-size is optimal.

So what about NCQ?

1=ncq_depth.txt:0:40.86***
2=ncq_depth.txt:0:40.99
4=ncq_depth.txt:0:42.52
8=ncq_depth.txt:0:43.57
16=ncq_depth.txt:0:42.54
31=ncq_depth.txt:0:42.51

Keeping it off seems best.

1=stripe_and_read_ahead.txt:0:40.86
2=stripe_and_read_ahead.txt:0:40.99
4=stripe_and_read_ahead.txt:0:42.52
8=stripe_and_read_ahead.txt:0:43.57
16=stripe_and_read_ahead.txt:0:42.54
31=stripe_and_read_ahead.txt:0:42.51
256=stripe_and_read_ahead.txt:1:44.16
1024=stripe_and_read_ahead.txt:1:07.01
2048=stripe_and_read_ahead.txt:0:53.59
4096=stripe_and_read_ahead.txt:0:45.66
8192=stripe_and_read_ahead.txt:0:40.73
  16384=stripe_and_read_ahead.txt:0:38.99**
16384=stripe_and_65536_read_ahead.txt:0:38.67
16384=stripe_and_65536_read_ahead.txt:0:38.69 (again, this is what I use 
from earlier benchmarks)

32768=stripe_and_read_ahead.txt:0:38.84

What about logbufs?

2=logbufs.txt:0:39.21
4=logbufs.txt:0:39.24
8=logbufs.txt:0:38.71

(again)

2=logbufs.txt:0:42.16
4=logbufs.txt:0:38.79
8=logbufs.txt:0:38.71** (yes)

What about logbsize?

16k=logbsize.txt:1:09.22
32k=logbsize.txt:0:38.70
64k=logbsize.txt:0:39.04
128k=logbsize.txt:0:39.06
256k=logbsize.txt:0:38.59** (best)


What about allocsize? (default=1024k)

4k=allocsize.txt:0:39.35
8k=allocsize.txt:0:38.95
16k=allocsize.txt:0:38.79
32k=allocsize.txt:0:39.71
64k=allocsize.txt:1:09.67
128k=allocsize.txt:0:39.04
256k=allocsize.txt:0:39.11
512k=allocsize.txt:0:39.01
1024k=allocsize.txt:0:38.75** (default)
2048k=allocsize.txt:0:39.07
4096k=allocsize.txt:0:39.15
8192k=allocsize.txt:0:39.40
16384k=allocsize.txt:0:39.36

What about the agcount?

2=agcount.txt:0:37.53
4=agcount.txt:0:38.56
8=agcount.txt:0:40.86
16=agcount.txt:0:39.05
32=agcount.txt:0:39.07** (default)
64=agcount.txt:0:39.29
128=agcount.txt:0:39.42
256=agcount.txt:0:38.76
512=agcount.txt:0:38.27
1024=agcount.txt:0:38.29
2048=agcount.txt:1:08.55
4096=agcount.txt:0:52.65
8192=agcount.txt:1:06.96
16384=agcount.txt:1:31.21
32768=agcount.txt:1:09.06
65536=agcount.txt:1:54.96


So far I have:

p34:~# mkfs.xfs -f -l lazy-count=1,version=2,size=128m -i attr=2 /dev/md3
meta-data=/dev/md3   isize=256agcount=32, agsize=10302272 
blks

 =   sectsz=4096  attr=2
data =   bsize=4096   blocks=329671296, imaxpct=25
 =   sunit=64 swidth=576 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal log   bsize=4096   blocks=32768, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=2359296 blocks=0, rtextents=0

p34:~# grep /dev/md3 /etc/fstab
/dev/md3/r1 xfs 
noatime,nodiratime,logbufs=8,logbsize=262144 0 1

Notice how mkfs.xfs 'knows' the sunit and swidth, and it is the correct 
units too because it is software raid, and it pulls this information from 
that layer, unlike HW raid which will not have a clue of what is 
underneath and say sunit=0,swidth=0.


However, in earlier testing I actually made them both 0 and it actually 
made performance better:


http://home.comcast.net/~jpiszcz/sunit-swidth/results.html

In any case, I am re-running bonnie++ once more with a 256 KiB chunk and 
will compare to those values in a bit.


Justin.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-16 Thread Justin Piszcz




On Wed, 16 Jan 2008, Justin Piszcz wrote:

For these benchmarks I timed how long it takes to extract a standard 4.4 GiB 
DVD:


Settings: Software RAID 5 with the following settings (until I change those 
too):


http://home.comcast.net/~jpiszcz/sunit-swidth/newresults.html

Any idea why an sunit and swidth of 0 (and -d agcount=4) is faster at least
with sequential input/output than the proper sunit/swidth that it should be?

It does not make sense.

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-16 Thread Justin Piszcz




On Wed, 16 Jan 2008, Al Boldi wrote:


Justin Piszcz wrote:

For these benchmarks I timed how long it takes to extract a standard 4.4
GiB DVD:

Settings: Software RAID 5 with the following settings (until I change
those too):

Base setup:
blockdev --setra 65536 /dev/md3
echo 16384  /sys/block/md3/md/stripe_cache_size
echo Disabling NCQ on all disks...
for i in $DISKS
do
   echo Disabling NCQ on $i
   echo 1  /sys/block/$i/device/queue_depth
done

p34:~# grep : *chunk* |sort -n
4-chunk.txt:0:45.31
8-chunk.txt:0:44.32
16-chunk.txt:0:41.02
32-chunk.txt:0:40.50
64-chunk.txt:0:40.88
128-chunk.txt:0:40.21
256-chunk.txt:0:40.14***
512-chunk.txt:0:40.35
1024-chunk.txt:0:41.11
2048-chunk.txt:0:43.89
4096-chunk.txt:0:47.34
8192-chunk.txt:0:57.86
16384-chunk.txt:1:09.39
32768-chunk.txt:1:26.61

It would appear a 256 KiB chunk-size is optimal.


Can you retest with different max_sectors_kb on both md and sd?
Remember this is SW RAID, so max_sectors_kb will only affect the 
individual disks underneath the SW RAID, I have benchmarked in the past, 
the defaults chosen by the kernel are optimal, changing them did not make 
any noticable improvements.



 Also, can you retest using dd with different block-sizes?

I can do this, moment..


I know about oflag=direct but I choose to use dd with sync and measure the 
total time it takes.
/usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero 
of=/r1/bigfile bs=1M count=10240; sync'


So I was asked on the mailing list to test dd with various chunk sizes, 
here is the length of time it took

to write 10 GiB and sync per each chunk size:

4=chunk.txt:0:25.46
8=chunk.txt:0:25.63
16=chunk.txt:0:25.26
32=chunk.txt:0:25.08
64=chunk.txt:0:25.55
128=chunk.txt:0:25.26
256=chunk.txt:0:24.72
512=chunk.txt:0:24.71
1024=chunk.txt:0:25.40
2048=chunk.txt:0:25.71
4096=chunk.txt:0:27.18
8192=chunk.txt:0:29.00
16384=chunk.txt:0:31.43
32768=chunk.txt:0:50.11
65536=chunk.txt:2:20.80

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-16 Thread Justin Piszcz


On Wed, 16 Jan 2008, Greg Cormier wrote:


What sort of tools are you using to get these benchmarks, and can I
used them for ext3?

Very interested in running this on my server.


Thanks,
Greg



You can use whatever suits you, such as untar kernel source tree, copy files, 
untar backups, etc--, you should benchmark specifically what *your* workload is.

Here is the skeleton, using bash:: (don't forget to turn off the cron 
daemon)


for i in 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536
do
  cd /
  umount /r1
  mdadm -S /dev/md3
  mdadm --create --assume-clean --verbose /dev/md3 --level=5 --raid-devices=10 
--chunk=$i --run /dev/sd[c-l]1

  /etc/init.d/oraid.sh # to optimize my raid stuff

  mkfs.xfs -f /dev/md3
  mount /dev/md3 /r1 -o logbufs=8,logbsize=262144

  # then simply add what you do often here
  # everyone's workload is different
  /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile 
bs=1M count=10240; sync'
done

Then just, grep : /root/*chunk* | sort -n to get the results in the same format.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-16 Thread Justin Piszcz




On Thu, 17 Jan 2008, Al Boldi wrote:


Justin Piszcz wrote:

On Wed, 16 Jan 2008, Al Boldi wrote:

Also, can you retest using dd with different block-sizes?


I can do this, moment..


I know about oflag=direct but I choose to use dd with sync and measure the
total time it takes.
/usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero
of=/r1/bigfile bs=1M count=10240; sync'

So I was asked on the mailing list to test dd with various chunk sizes,
here is the length of time it took
to write 10 GiB and sync per each chunk size:

4=chunk.txt:0:25.46
8=chunk.txt:0:25.63
16=chunk.txt:0:25.26
32=chunk.txt:0:25.08
64=chunk.txt:0:25.55
128=chunk.txt:0:25.26
256=chunk.txt:0:24.72
512=chunk.txt:0:24.71
1024=chunk.txt:0:25.40
2048=chunk.txt:0:25.71
4096=chunk.txt:0:27.18
8192=chunk.txt:0:29.00
16384=chunk.txt:0:31.43
32768=chunk.txt:0:50.11
65536=chunk.txt:2:20.80


What do you get with bs=512,1k,2k,4k,8k,16k...


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Done testing for now, but I did test with 256k with a 256k chunk and 
obviously that got good results, just like 1m with a 1mb chunk, 460-480 
MiB/s.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-16 Thread Justin Piszcz




On Thu, 17 Jan 2008, Al Boldi wrote:


Justin Piszcz wrote:

On Wed, 16 Jan 2008, Al Boldi wrote:

Also, can you retest using dd with different block-sizes?


I can do this, moment..


I know about oflag=direct but I choose to use dd with sync and measure the
total time it takes.
/usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero
of=/r1/bigfile bs=1M count=10240; sync'

So I was asked on the mailing list to test dd with various chunk sizes,
here is the length of time it took
to write 10 GiB and sync per each chunk size:

4=chunk.txt:0:25.46
8=chunk.txt:0:25.63
16=chunk.txt:0:25.26
32=chunk.txt:0:25.08
64=chunk.txt:0:25.55
128=chunk.txt:0:25.26
256=chunk.txt:0:24.72
512=chunk.txt:0:24.71
1024=chunk.txt:0:25.40
2048=chunk.txt:0:25.71
4096=chunk.txt:0:27.18
8192=chunk.txt:0:29.00
16384=chunk.txt:0:31.43
32768=chunk.txt:0:50.11
65536=chunk.txt:2:20.80


What do you get with bs=512,1k,2k,4k,8k,16k...


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



root  4621  0.0  0.0  12404   760 pts/2D+   17:53   0:00 mdadm -S 
/dev/md3
root  4664  0.0  0.0   4264   728 pts/5S+   17:54   0:00 grep D

Tried to stop it when it was re-syncing, DEADLOCK :(

[  305.464904] md: md3 still in use.
[  314.595281] md: md_do_sync() got signal ... exiting

Anyhow, done testing, time to move data back on if I can kill the resync 
process w/out deadlock.


Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

How do I get rid of old device?

2008-01-16 Thread Justin Piszcz


p34:~# mdadm /dev/md3 --zero-superblock
p34:~# mdadm --examine --scan
ARRAY /dev/md0 level=raid1 num-devices=2 
UUID=f463057c:9a696419:3bcb794a:7aaa12b2
ARRAY /dev/md1 level=raid1 num-devices=2 
UUID=98e4948c:c6685f82:e082fd95:e7f45529
ARRAY /dev/md2 level=raid1 num-devices=2 
UUID=330c9879:73af7d3e:57f4c139:f9191788
ARRAY /dev/md3 level=raid0 num-devices=10 
UUID=6dc12c36:b3517ff9:083fb634:68e9eb49

p34:~#

I cannot seem to get rid of /dev/md3, its almost as if there is a piece of 
it on the root (2) disks or reference to it?


I also dd'd the other 10 disks (non-root) and /dev/md3 persists.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How do I get rid of old device?

2008-01-16 Thread Justin Piszcz



On Wed, 16 Jan 2008, Justin Piszcz wrote:


p34:~# mdadm /dev/md3 --zero-superblock
p34:~# mdadm --examine --scan
ARRAY /dev/md0 level=raid1 num-devices=2 
UUID=f463057c:9a696419:3bcb794a:7aaa12b2
ARRAY /dev/md1 level=raid1 num-devices=2 
UUID=98e4948c:c6685f82:e082fd95:e7f45529
ARRAY /dev/md2 level=raid1 num-devices=2 
UUID=330c9879:73af7d3e:57f4c139:f9191788
ARRAY /dev/md3 level=raid0 num-devices=10 
UUID=6dc12c36:b3517ff9:083fb634:68e9eb49

p34:~#

I cannot seem to get rid of /dev/md3, its almost as if there is a piece of it 
on the root (2) disks or reference to it?


I also dd'd the other 10 disks (non-root) and /dev/md3 persists.




Hopefully this will clear it out:

p34:~# for i in /dev/sd[c-l]; do /usr/bin/time dd if=/dev/zero of=$i bs=1M 
  done

[1] 4625
[2] 4626
[3] 4627
[4] 4628
[5] 4629
[6] 4630
[7] 4631
[8] 4632
[9] 4633
[10] 4634
p34:~#

Good aggregate bandwidth at least writing to all 10 disks.

procs ---memory-- ---swap-- -io -system--cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 1  9  0  46472 7201008  7342400 0 658756 2339 2242  0 22 24 54
 3 10  0  44132 7204680  7329200 0 660040 2335 2276  0 22 19 59
 5  8  0  48196 7201840  7373600 0 652708 2403 1645  0 23 11 66
 2  9  0  45728 7205036  7262800 0 659844 2296 1891  0 23 11 66
 0 11  0  47672 7202992  7256400 0 672856 2327 1616  0 22  7 71

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: New XFS benchmarks using David Chinner's recommendations for XFS-based optimizations.

2008-01-04 Thread Justin Piszcz




On Fri, 4 Jan 2008, Changliang Chen wrote:


Hi Justin£¬

From your report£¬It looks that the p34-default's behavior is
better£¬which item make you consider that the p34-dchinner looks nice£¿

--
Best Regards

The re-write and sequential input and output is faster for dchinner.

Justin.

Re: Change Stripe size?

2007-12-31 Thread Justin Piszcz




On Mon, 31 Dec 2007, Greg Cormier wrote:


So I've been slowly expanding my knowledge of mdadm/linux raid.

I've got a 1 terabyte array which stores mostly large media files, and
from my reading, increasing the stripe size should really help my
performance

Is there any way to do this to an existing array, or will I need to
backup the array and re-create it with a larger stripe size?


Thanks and happy new year!
Greg
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Backup the array and re-create it is currently the only way with sw 
raid AFAIK..


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

AS/CFQ/DEADLINE/NOOP on Linux SW RAID?

2007-12-30 Thread Justin Piszcz

When setting the scheduler, is it possible to set it on /dev/mdX or is it 
only possible to set it on the underlying devices which compose the sw 
raid device? /dev/sda /dev/sdb and does that really affect how the data is 
accessed by specifying the underlying device and not mdX?


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread Justin Piszcz




On Sat, 29 Dec 2007, dean gaudet wrote:


On Tue, 25 Dec 2007, Bill Davidsen wrote:


The issue I'm thinking about is hardware sector size, which on modern drives
may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
when writing a 512b block.


i'm not sure any shipping SATA disks have larger than 512B sectors yet...
do you know of any?  (or is this thread about SCSI which i don't pay
attention to...)

on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:

255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

so sda1 starts at a non-multiple of 4096 into the disk.

i ran some random seek+write experiments using
http://arctic.org/~dean/randomio/, here are the results using 512 byte
and 4096 byte writes (fsync after each write), 8 threads, on sda1:

# ./randomio /dev/sda1 8 1 1 512 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 148.5 |0.0   infnan0.0nan |  148.5   0.2   53.7   89.3   19.5
 129.2 |0.0   infnan0.0nan |  129.2  37.2   61.9   96.79.3
 131.2 |0.0   infnan0.0nan |  131.2  40.3   61.0   90.49.3
 132.0 |0.0   infnan0.0nan |  132.0  39.6   60.6   89.39.1
 130.7 |0.0   infnan0.0nan |  130.7  39.8   61.3   98.18.9
 131.4 |0.0   infnan0.0nan |  131.4  40.0   60.8  101.09.6
# ./randomio /dev/sda1 8 1 1 4096 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 141.7 |0.0   infnan0.0nan |  141.7   0.3   56.3   99.3   21.1
 132.4 |0.0   infnan0.0nan |  132.4  43.3   60.4   91.88.5
 131.6 |0.0   infnan0.0nan |  131.6  41.4   60.9  111.09.6
 131.8 |0.0   infnan0.0nan |  131.8  41.4   60.7   85.38.6
 130.6 |0.0   infnan0.0nan |  130.6  41.7   61.3   95.09.4
 131.4 |0.0   infnan0.0nan |  131.4  42.2   60.8   90.58.4


i think the anomalous results in the first 10s samples are perhaps the drive
coming out of a standby state.

and here are the results aligned using the sda raw device itself:

# ./randomio /dev/sda 8 1 1 512 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 147.3 |0.0   infnan0.0nan |  147.3   0.3   54.1   93.7   20.1
 132.4 |0.0   infnan0.0nan |  132.4  37.4   60.6   91.89.2
 132.5 |0.0   infnan0.0nan |  132.5  37.7   60.3   93.79.3
 131.8 |0.0   infnan0.0nan |  131.8  39.4   60.7   92.79.0
 133.9 |0.0   infnan0.0nan |  133.9  41.7   59.8   90.78.5
 130.2 |0.0   infnan0.0nan |  130.2  40.8   61.5   88.68.9
# ./randomio /dev/sda 8 1 1 4096 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 145.4 |0.0   infnan0.0nan |  145.4   0.3   54.9   94.0   20.1
 130.3 |0.0   infnan0.0nan |  130.3  36.0   61.4   92.79.6
 130.6 |0.0   infnan0.0nan |  130.6  38.2   61.2   96.79.2
 132.1 |0.0   infnan0.0nan |  132.1  39.0   60.5   93.59.2
 131.8 |0.0   infnan0.0nan |  131.8  43.1   60.8   93.89.1
 129.0 |0.0   infnan0.0nan |  129.0  40.2   62.0   96.48.8

it looks pretty much the same to me...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Good to know/have it confirmed by someone else, the alignment does not 
matter with Linux/SW RAID.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-29 Thread Justin Piszcz




On Sat, 29 Dec 2007, dean gaudet wrote:


On Sat, 29 Dec 2007, Dan Williams wrote:


On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote:

hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on
the same 64k chunk array and had raised the stripe_cache_size to 1024...
and got a hang.  this time i grabbed stripe_cache_active before bumping
the size again -- it was only 905 active.  as i recall the bug we were
debugging a year+ ago the active was at the size when it would hang.  so
this is probably something new.


I believe I am seeing the same issue and am trying to track down
whether XFS is doing something unexpected, i.e. I have not been able
to reproduce the problem with EXT3.  MD tries to increase throughput
by letting some stripe work build up in batches.  It looks like every
time your system has hung it has been in the 'inactive_blocked' state
i.e.  3/4 of stripes active.  This state should automatically
clear...


cool, glad you can reproduce it :)

i have a bit more data... i'm seeing the same problem on debian's
2.6.22-3-amd64 kernel, so it's not new in 2.6.24.

i'm doing some more isolation but just grabbing kernels i have precompiled
so far -- a 2.6.19.7 kernel doesn't show the problem, and early
indications are a 2.6.21.7 kernel also doesn't have the problem but i'm
giving it longer to show its head.

i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just
so we get the debian patches out of the way.

i was tempted to blame async api because it's newish :)  but according to
the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async
API, and it still hung, so async is probably not to blame.

anyhow the test case i'm using is the dma_thrasher script i attached... it
takes about an hour to give me confidence there's no problems so this will
take a while.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Dean,

Curious btw what kind of filesystem size/raid type (5, but defaults 
I assume, nothing special right? (right-symmetric vs. 
left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with?


The script you sent out earlier, you are able to reproduce it easily with 
31 or so kernel tar decompressions?


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2007-12-27 Thread Justin Piszcz




On Thu, 27 Dec 2007, dean gaudet wrote:


hey neil -- remember that raid5 hang which me and only one or two others
ever experienced and which was hard to reproduce?  we were debugging it
well over a year ago (that box has 400+ day uptime now so at least that
long ago :)  the workaround was to increase stripe_cache_size... i seem to
have a way to reproduce something which looks much the same.

setup:

- 2.6.24-rc6
- system has 8GiB RAM but no swap
- 8x750GB in a raid5 with one spare, chunksize 1024KiB.
- mkfs.xfs default options
- mount -o noatime
- dd if=/dev/zero of=/mnt/foo bs=4k count=2621440

that sequence hangs for me within 10 seconds... and i can unhang / rehang
it by toggling between stripe_cache_size 256 and 1024.  i detect the hang
by watching iostat -kx /dev/sd? 5.

i've attached the kernel log where i dumped task and timer state while it
was hung... note that you'll see at some point i did an xfs mount with
external journal but it happens with internal journal as well.

looks like it's using the raid456 module and async api.

anyhow let me know if you need more info / have any suggestions.

-dean


With that high of a stripe size the stripe_cache_size needs to be greater 
than the default to handle it.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Justin Piszcz

On Thu, 20 Dec 2007, Bill Davidsen wrote:

Justin Piszcz wrote:

On Wed, 19 Dec 2007, Bill Davidsen wrote:

I'm going to try another approach, I'll describe it when I get results (or
not).

http://home.comcast.net/~jpiszcz/align_vs_noalign/

Hardly any difference at whatsoever, only on the per char for read/write is
it any faster..?

Am I misreading what you are doing here... you have the underlying data on
the actual hardware devices 64k aligned by using either the whole device or
starting a partition on a 64k boundary? I'm dubious that you will see a
difference any other way, after all the translations take place.

I'm trying creating a raid array using loop devices created with the offset
parameter, but I suspect that I will wind up doing a test after just
repartitioning the drives, painful as that will be.

Average of 3 runs taken:

$ cat align/*log|grep ,
p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:10:16/64,1334210,10,330,2,120,1,3978,10,312,2
p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:10:16/64,1252548,6,296,1,115,1,7927,20,373,2
p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:10:16/64,1242445,6,303,1,117,1,6767,17,359,2

$ cat noalign/*log|grep ,
p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:10:16/64,1353180,8,314,1,117,1,8684,22,283,2
p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:10:16/64,12211519,29,297,1,113,1,3218,8,325,2
p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:10:16/64,1243229,8,307,1,120,1,4247,11,313,2

--
Bill Davidsen [EMAIL PROTECTED]
Woe unto the statesman who makes war without a reason that will still
be valid when the war is over... Otto von Bismark

1. The first I made partitions on each drive like I normally do.
2. The second test was I followed the EMC document on how to properly
align the partitions and I followed Microsoft's document on how to
calculate the correct offset, I used 512 for 256k stripe.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz


The (up to) 30% percent figure is mentioned here:
http://insights.oetiker.ch/linux/raidoptimization.html

On http://forums.storagereview.net/index.php?showtopic=25786:
This user writes about the problem:
XP, and virtually every O/S and partitioning software of XP's day, by default 
places the first partition on a disk at sector 63. Being an odd number, and 
31.5KB into the drive, it isn't ever going to align with any stripe size. This 
is an unfortunate industry standard.
Vista on the other hand, aligns the first partition on sector 2048 by default 
as a by-product of it's revisions to support large-sector sized hard drives. As 
RAID5 arrays in write mode mimick the performance characteristics of 
large-sector size hard drives, this comes as a great if not inadvertent 
benefit. 2048 is evenly divisible by 2 and 4 (allowing for 3 and 5 drive arrays 
optimally) and virtually every stripe size in common use. If you are however 
using a 4-drive RAID5, you're SOOL.

Page 9 in this PDF (EMC_BestPractice_R22.pdf) shows the problem graphically:
http://bbs.doit.com.cn/attachment.php?aid=6757

--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

   Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid autodetect

---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe aligned?


Or is there a better way to do this, does parted handle this situation 
better?


What is the best (and correct) way to calculate stripe-alignment on the 
RAID5 device itself?


---

The EMC paper recommends:

Disk partition adjustment for Linux systems
In Linux, align the partition table before data is written to the LUN, as 
the partition map will be rewritten
and all data on the LUN destroyed. In the following example, the LUN is 
mapped to
/dev/emcpowerah, and the LUN stripe element size is 128 blocks. Arguments 
for the fdisk utility are

as follows:
fdisk/dev/emcpowerah
x  # expert mode
b  # adjust starting block number
1  # choose partition 1
128 #set it to 128, our stripe element size
w  # write the new partition

---

Does this also apply to Linux/SW RAID5?  Or are there any caveats that are 
not taken into account since it is based in SW vs. HW?


---

What it currently looks like:

Command (m for help): x

Expert command (m for help): p

Disk /dev/sdc: 255 heads, 63 sectors, 18241 cylinders

Nr AF  Hd Sec  Cyl  Hd Sec  Cyl Start  Size ID
 1 00   1   10 254  63 1023 63  293041602 fd
 2 00   0   00   0   00  0  0 00
 3 00   0   00   0   00  0  0 00
 4 00   0   00   0   00  0  0 00

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid over 48 disks

2007-12-19 Thread Justin Piszcz




On Wed, 19 Dec 2007, Bill Davidsen wrote:


Thiemo Nagel wrote:

Performance of the raw device is fair:
# dd if=/dev/md2 of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s

Somewhat less through ext3 (created with -E stride=64):
# dd if=largetestfile of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s


Quite slow?

10 disks (raptors) raid 5 on regular sata controllers:

# dd if=/dev/md3 of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s

# dd if=bigfile of=/dev/zero bs=128k count=64k
3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s


Interesting.  Any ideas what could be the reason?  How much do you get from 
a single drive?  -- The Samsung HD501LJ that I'm using gives ~84MB/s when 
reading from the beginning of the disk.


With RAID 5 I'm getting slightly better results (though I really wonder 
why, since naively I would expect identical read performance) but that does 
only account for a small part of the difference:


16k read64k write
  chunk
  sizeRAID 5RAID 6RAID 5RAID 6
  128k492497268270
  256k615530288270
  512k625607230174
  1024k   65062017075



What is your stripe cache size?


# Set stripe-cache_size for RAID5.
echo Setting stripe_cache_size to 16 MiB for /dev/md3
echo 16384  /sys/block/md3/md/stripe_cache_size

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz




On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe aligned?


Or is there a better way to do this, does parted handle this situation 
better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw raid) 
and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices 
again I did not plug them back in the same order, everything worked of 
course but when I ran LILO it said it was not part of the RAID set, 
because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, if 
I had not used partitions here, I'd have lost (or more of the drives) due 
to a bad LILO run?


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz




On Wed, 19 Dec 2007, Jon Nelson wrote:


On 12/19/07, Justin Piszcz [EMAIL PROTECTED] wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:

From that setup it seems simple, scrap the partition table and use the

disk device for raid. This is what we do for all data storage disks (hw raid)
and sw raid members.

/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my


There is one (just pointed out to me yesterday): having the partition
and having it labeled as raid makes identification quite a bit easier
for humans and software, too.

--
Jon



Some nice graphs found here:
http://sqlblog.com/blogs/linchi_shea/archive/2007/02/01/performance-impact-of-disk-misalignment.aspx

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz




On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe aligned?


Or is there a better way to do this, does parted handle this situation 
better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw 
raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices again 
I did not plug them back in the same order, everything worked of course but 
when I ran LILO it said it was not part of the RAID set, because /dev/sda 
had become /dev/sdg and overwrote the MBR on the disk, if I had not used 
partitions here, I'd have lost (or more of the drives) due to a bad LILO 
run?


As other posts have detailed, putting the partition on a 64k aligned boundary 
can address the performance problems. However, a poor choice of chunk size, 
cache_buffer size, or just random i/o in small sizes can eat up a lot of the 
benefit.


I don't think you need to give up your partitions to get the benefit of 
alignment.


--
Bill Davidsen [EMAIL PROTECTED]
Woe unto the statesman who makes war without a reason that will still
be valid when the war is over... Otto von Bismark 



Hrmm..

I am doing a benchmark now with:

6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.

unligned, just fdisk /dev/sdc, mkpartition, fd raid.
 aligned, fdisk, expert, start at 512 as the off-set

Per a Microsoft KB:

Example of alignment calculations in kilobytes for a 256-KB stripe unit 
size:

(63 * .5) / 256 = 0.123046875
(64 * .5) / 256 = 0.125
(128 * .5) / 256 = 0.25
(256 * .5) / 256 = 0.5
(512 * .5) / 256 = 1
These examples shows that the partition is not aligned correctly for a 
256-KB stripe unit size until the partition is created by using an offset 
of 512 sectors (512 bytes per sector).


So I should start at 512 for a 256k chunk size.

I ran bonnie++ three consecutive times and took the average for the 
unaligned, rebuilding the RAID5 now and then I will re-execute the test 3 
additional times and take the average of that.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: help diagnosing bad disk

2007-12-19 Thread Justin Piszcz




On Wed, 19 Dec 2007, Jon Sabo wrote:


So I was trying to copy over some Indiana Jones wav files and it
wasn't going my way.  I noticed that my software raid device showed:

/dev/md1 on / type ext3 (rw,errors=remount-ro)

Is this saying that it was remounted, read only because it found a
problem with the md1 meta device?  That's what it looks like it's
saying but I can still write to /.

mdadm --detail showed:

[EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md0
/dev/md0:
   Version : 00.90.03
 Creation Time : Mon Jul 30 21:47:14 2007
Raid Level : raid1
Array Size : 1951744 ( 1906.32 MiB 1998.59 MB)
   Device Size : 1951744 (1906.32 MiB 1998.59 MB)
  Raid Devices : 2
 Total Devices : 1
Preferred Minor : 0
   Persistence : Superblock is persistent

   Update Time : Wed Dec 19 12:59:56 2007
 State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
 Spare Devices : 0

  UUID : 157f716c:0e7aebca:c20741f6
:bb6099c9
Events : 0.28

Number   Major   Minor   RaidDevice State
  0   810  active sync   /dev/sda1
  1   001  removed

[EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md1
/dev/md1:
   Version : 00.90.03
 Creation Time : Mon Jul 30 21:47:47 2007
Raid Level : raid1
Array Size : 974808064 (929.65 GiB 998.20 GB)
   Device Size : 974808064 (929.65 GiB 998.20 GB)
   Raid Devices : 2
 Total Devices : 1
Preferred Minor : 1
   Persistence : Superblock is persistent

   Update Time : Wed Dec 19 13:14:53 2007
 State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
 Spare Devices : 0

  UUID : 156a030e:9a6f8eb3:9b0c439e:d718e744
Events : 0.1990

   Number   Major   Minor   RaidDevice State
  0   820  active sync   /dev/sda2
  1   001  removed


I have two 1 terabyte sata drives in this box.  From what I was
reading wouldn't it show an F for the failed drive?  I thought I would
see that /dev/sdb1 and /dev/sdb2 were failed and it would show an F.
What is this saying and how do you know that its /dev/sdb and not some
other drive?  It shows removed and that the state is clean, degraded.
Is that something you can recover from with out returning this disk
and putting in a new one to add to the raid1 array?


mdadm /dev/md1 -a /dev/sdb2 to re-add it back into the array

What does cat /proc/mdstat show?

I would also show us: smartctl -a /dev/sdb

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: help diagnosing bad disk

2007-12-19 Thread Justin Piszcz




On Wed, 19 Dec 2007, Jon Sabo wrote:


I found the problem.   The power was unplugged from the drive.  The
sata power connectors aren't very good at securing the connector.  I
reattached the power connector to the sata drive and booted up.  This
is what it looks like now:

[EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md0
/dev/md0:
   Version : 00.90.03
 Creation Time : Mon Jul 30 21:47:14 2007
Raid Level : raid1
Array Size : 1951744 (1906.32 MiB 1998.59 MB)
   Device Size : 1951744 (1906.32 MiB 1998.59 MB)
  Raid Devices : 2
 Total Devices : 1
Preferred Minor : 0
   Persistence : Superblock is persistent

   Update Time : Wed Dec 19 13:48:12 2007
 State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
 Spare Devices : 0

  UUID : 157f716c:0e7aebca:c20741f6:bb6099c9
Events : 0.44

   Number   Major   Minor   RaidDevice State
  0   810  active sync   /dev/sda1
  1   001  removed
[EMAIL PROTECTED]:/home/illsci# mdadm --detail /dev/md1
/dev/md1:
   Version : 00.90.03
 Creation Time : Mon Jul 30 21:47:47 2007
Raid Level : raid1
Array Size : 974808064 (929.65 GiB 998.20 GB)
   Device Size : 974808064 (929.65 GiB 998.20 GB)
  Raid Devices : 2
 Total Devices : 1
Preferred Minor : 1
   Persistence : Superblock is persistent

   Update Time : Wed Dec 19 13:50:02 2007
 State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
 Spare Devices : 0

  UUID : 156a030e:9a6f8eb3:9b0c439e:d718e744
Events : 0.1498340

   Number   Major   Minor   RaidDevice State
  0   000  removed
  1   8   181  active sync   /dev/sdb2


How do I put it back into the correct state?

Thanks!


mdadm /dev/md0 -a /dev/sdb1
mdadm /dev/md1 -a /dev/sda1

Weird that they got out out of sync on different drives.

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz




On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe 
aligned?


Or is there a better way to do this, does parted handle this situation 
better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw 
raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices 
again I did not plug them back in the same order, everything worked of 
course but when I ran LILO it said it was not part of the RAID set, 
because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, 
if I had not used partitions here, I'd have lost (or more of the drives) 
due to a bad LILO run?


As other posts have detailed, putting the partition on a 64k aligned 
boundary can address the performance problems. However, a poor choice of 
chunk size, cache_buffer size, or just random i/o in small sizes can eat 
up a lot of the benefit.


I don't think you need to give up your partitions to get the benefit of 
alignment.


--
Bill Davidsen [EMAIL PROTECTED]
Woe unto the statesman who makes war without a reason that will still
be valid when the war is over... Otto von Bismark


Hrmm..

I am doing a benchmark now with:

6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.

unligned, just fdisk /dev/sdc, mkpartition, fd raid.
 aligned, fdisk, expert, start at 512 as the off-set

Per a Microsoft KB:

Example of alignment calculations in kilobytes for a 256-KB stripe unit 
size:

(63 * .5) / 256 = 0.123046875
(64 * .5) / 256 = 0.125
(128 * .5) / 256 = 0.25
(256 * .5) / 256 = 0.5
(512 * .5) / 256 = 1
These examples shows that the partition is not aligned correctly for a 
256-KB stripe unit size until the partition is created by using an offset 
of 512 sectors (512 bytes per sector).


So I should start at 512 for a 256k chunk size.

I ran bonnie++ three consecutive times and took the average for the 
unaligned, rebuilding the RAID5 now and then I will re-execute the test 3 
additional times and take the average of that.


I'm going to try another approach, I'll describe it when I get results (or 
not).


Waiting for the raid to rebuild then I will re-run thereafter.

  [=...]  recovery = 86.7% (339104640/390708480) 
finish=30.8min speed=27835K/sec


...


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz




On Wed, 19 Dec 2007, Bill Davidsen wrote:

I'm going to try another approach, I'll describe it when I get results (or 
not).


http://home.comcast.net/~jpiszcz/align_vs_noalign/

Hardly any difference at whatsoever, only on the per char for read/write 
is it any faster..?


Average of 3 runs taken:

$ cat align/*log|grep ,
p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:10:16/64,1334210,10,330,2,120,1,3978,10,312,2
p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:10:16/64,1252548,6,296,1,115,1,7927,20,373,2
p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:10:16/64,1242445,6,303,1,117,1,6767,17,359,2

$ cat noalign/*log|grep ,
p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:10:16/64,1353180,8,314,1,117,1,8684,22,283,2
p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:10:16/64,12211519,29,297,1,113,1,3218,8,325,2
p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:10:16/64,1243229,8,307,1,120,1,4247,11,313,2

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz




On Wed, 19 Dec 2007, Robin Hill wrote:


On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:


The (up to) 30% percent figure is mentioned here:
http://insights.oetiker.ch/linux/raidoptimization.html


That looks to be referring to partitioning a RAID device - this'll only
apply to hardware RAID or partitionable software RAID, not to the normal
use case.  When you're creating an array out of standard partitions then
you know the array stripe size will align with the disks (there's no way
it cannot), and you can set the filesystem stripe size to align as well
(XFS will do this automatically).

I've actually done tests on this with hardware RAID to try to find the
correct partition offset, but wasn't able to see any difference (using
bonnie++ and moving the partition start by one sector at a time).


# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

   Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid
autodetect


This looks to be a normal disk - the partition offsets shouldn't be
relevant here (barring any knowledge of the actual physical disk layout
anyway, and block remapping may well make that rather irrelevant).

That's my take on this one anyway.

Cheers,
   Robin
--
___
   ( ' } |   Robin Hill[EMAIL PROTECTED] |
  / / )  | Little Jim says |
 // !!   |  He fallen in de water !! |



Interesting, yes, I am using XFS as well, thanks for the response.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid over 48 disks

2007-12-18 Thread Justin Piszcz




On Tue, 18 Dec 2007, Norman Elton wrote:

We're investigating the possibility of running Linux (RHEL) on top of Sun's 
X4500 Thumper box:


http://www.sun.com/servers/x64/x4500/

Basically, it's a server with 48 SATA hard drives. No hardware RAID. It's 
designed for Sun's ZFS filesystem.


So... we're curious how Linux will handle such a beast. Has anyone run MD 
software RAID over so many disks? Then piled LVM/ext3 on top of that? Any 
suggestions?


Are we crazy to think this is even possible?

Thanks!

Norman Elton
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


It sounds VERY fun and exciting if you ask me!  The most disks I've used 
when testing SW RAID was 10 with various raid settings.  With that many 
drives you'd want RAID6 or RAID10 for sure incase more than one failed at 
the same time and definitely XFS/JFS/EXT4(?) as EXT3 is capped to 8TB.


I'd be curious what kind of aggregate bandwidth you can get off of it with 
that many drives.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid over 48 disks

2007-12-18 Thread Justin Piszcz




On Tue, 18 Dec 2007, Thiemo Nagel wrote:


Dear Norman,

So... we're curious how Linux will handle such a beast. Has anyone run MD 
software RAID over so many disks? Then piled LVM/ext3 on top of that? Any 
suggestions?


Are we crazy to think this is even possible?


I'm running 22x 500GB disks attached to RocketRaid2340 and NFORCE-MCP55
onboard controllers on an Athlon DC 5000+ with 1GB RAM:

9746150400 blocks super 1.2 level 6, 256k chunk, algorithm 2 [22/22]

Performance of the raw device is fair:
# dd if=/dev/md2 of=/dev/zero bs=128k count=64k
65536+0 records in
65536+0 records out
8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s

Somewhat less through ext3 (created with -E stride=64):
# dd if=largetestfile of=/dev/zero bs=128k count=64k
65536+0 records in
65536+0 records out
8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s

There were no problems up to now.  (mkfs.ext3 wants -F to create a filesystem 
larger than 8TB.  The hard maximum is 16TB, so you will need to create 
partitions, if your drives are larger than 350GB...)


Kind regards,

Thiemo Nagel


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Quite slow?

10 disks (raptors) raid 5 on regular sata controllers:

# dd if=/dev/md3 of=/dev/zero bs=128k count=64k
65536+0 records in
65536+0 records out
8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s

# dd if=bigfile of=/dev/zero bs=128k count=64k
27773+1 records in
27773+1 records out
3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid over 48 disks

2007-12-18 Thread Justin Piszcz




On Tue, 18 Dec 2007, Thiemo Nagel wrote:


Performance of the raw device is fair:
# dd if=/dev/md2 of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s

Somewhat less through ext3 (created with -E stride=64):
# dd if=largetestfile of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s


Quite slow?

10 disks (raptors) raid 5 on regular sata controllers:

# dd if=/dev/md3 of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s

# dd if=bigfile of=/dev/zero bs=128k count=64k
3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s


Interesting.  Any ideas what could be the reason?  How much do you get from a 
single drive?  -- The Samsung HD501LJ that I'm using gives ~84MB/s when 
reading from the beginning of the disk.


With RAID 5 I'm getting slightly better results (though I really wonder why, 
since naively I would expect identical read performance) but that does only 
account for a small part of the difference:


16k read64k write
chunk
sizeRAID 5  RAID 6  RAID 5  RAID 6
128k492 497 268 270
256k615 530 288 270
512k625 607 230 174
1024k   650 620 170 75

Kind regards,

Thiemo



# dd if=/dev/sdc of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 13.8108 seconds, 77.7 MB/s

With more than 2x the drives I'd think you'd have faster speed, perhaps 
the contoller is the problem?


I am using ICH8R (but the raid within linux) and 2 port SATA cards, each 
has their own dedicated bandwidth via PCI-e bus.


I have also tried (on 3ware controllers exporting as JBOD etc, sw RAID5) 
with 10 disks, I saw similar performance with read but not write.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid over 48 disks

2007-12-18 Thread Justin Piszcz




On Tue, 18 Dec 2007, Jon Nelson wrote:


On 12/18/07, Thiemo Nagel [EMAIL PROTECTED] wrote:

Performance of the raw device is fair:
# dd if=/dev/md2 of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s

Somewhat less through ext3 (created with -E stride=64):
# dd if=largetestfile of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s


Quite slow?

10 disks (raptors) raid 5 on regular sata controllers:

# dd if=/dev/md3 of=/dev/zero bs=128k count=64k
8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s

# dd if=bigfile of=/dev/zero bs=128k count=64k
3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s


Interesting.  Any ideas what could be the reason?  How much do you get
from a single drive?  -- The Samsung HD501LJ that I'm using gives
~84MB/s when reading from the beginning of the disk.

With RAID 5 I'm getting slightly better results (though I really wonder
why, since naively I would expect identical read performance) but that
does only account for a small part of the difference:

16k read64k write
chunk
sizeRAID 5  RAID 6  RAID 5  RAID 6
128k492 497 268 270
256k615 530 288 270
512k625 607 230 174
1024k   650 620 170 75


It strikes me that these numbers are meaningless without knowing if
that is actual data-to-disk or data-to-memcache-and-some-to-disk-too.
Later versions of 'dd' offer 'conv=fdatasync' which is really handy
(call fdatasync on the output file, syncing JUST the one file, right
before close). Otherwise, oflags=direct will (try) to bypass the
page/block cache.

I can get really impressive numbers, too (over 200MB/s on a single
disk capable of 70MB/s) when I (mis)use dd without fdatasync, et al.

The variation in reported performance can be really huge without
understanding that you aren't actually testing the DISK I/O but *some*
disk I/O and *some* memory caching.


Ok-- How's this for caching, a DD over the entire RAID device:

$ /usr/bin/time dd if=/dev/zero of=file bs=1M
dd: writing `file': No space left on device
1070704+0 records in
1070703+0 records out
1122713473024 bytes (1.1 TB) copied, 2565.89 seconds, 438 MB/s

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Raid over 48 disks

2007-12-18 Thread Justin Piszcz




On Tue, 18 Dec 2007, Guy Watkins wrote:


} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of Brendan Conoboy
} Sent: Tuesday, December 18, 2007 3:36 PM
} To: Norman Elton
} Cc: linux-raid@vger.kernel.org
} Subject: Re: Raid over 48 disks
}
} Norman Elton wrote:
}  We're investigating the possibility of running Linux (RHEL) on top of
}  Sun's X4500 Thumper box:
} 
}  http://www.sun.com/servers/x64/x4500/
}
} Neat- 6 8 port SATA controllers!  It'll be worth checking to be sure
} each controller has equal bandwidth.  If some controllers are on slower
} buses than others you may want to consider that and balance the md
} device layout.

Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays
using 2 disks from each controller.  That way any 1 controller can fail and
your system will still be running.  6 disks will be used for redundancy.

Or 6 8 disk RAID6 arrays using 1 disk from each controller).  That way any 2
controllers can fail and your system will still be running.  12 disks will
be used for redundancy.  Might be too excessive!

Combine them into a RAID0 array.

Guy

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I'd be curious what the maximum aggregate bandwidth would be with RAID 0 
of 48 disks on that controller..

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Raid over 48 disks

2007-12-18 Thread Justin Piszcz




On Tue, 18 Dec 2007, Justin Piszcz wrote:




On Tue, 18 Dec 2007, Guy Watkins wrote:


} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of Brendan Conoboy
} Sent: Tuesday, December 18, 2007 3:36 PM
} To: Norman Elton
} Cc: linux-raid@vger.kernel.org
} Subject: Re: Raid over 48 disks
}
} Norman Elton wrote:
}  We're investigating the possibility of running Linux (RHEL) on top of
}  Sun's X4500 Thumper box:
} 
}  http://www.sun.com/servers/x64/x4500/
}
} Neat- 6 8 port SATA controllers!  It'll be worth checking to be sure
} each controller has equal bandwidth.  If some controllers are on slower
} buses than others you may want to consider that and balance the md
} device layout.

Assuming the 6 controllers are equal, I would make 3 16 disk RAID6 arrays
using 2 disks from each controller.  That way any 1 controller can fail and
your system will still be running.  6 disks will be used for redundancy.

Or 6 8 disk RAID6 arrays using 1 disk from each controller).  That way any 
2

controllers can fail and your system will still be running.  12 disks will
be used for redundancy.  Might be too excessive!

Combine them into a RAID0 array.

Guy

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I'd be curious what the maximum aggregate bandwidth would be with RAID 0 of 
48 disks on that controller..




A RAID 0 over all of the controllers rather, if possible..


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: optimal IO scheduler choice?

2007-12-13 Thread Justin Piszcz




On Thu, 13 Dec 2007, Louis-David Mitterrand wrote:


Hi,

after reading some interesting suggestions on kernel tuning at:

http://hep.kbfi.ee/index.php/IT/KernelTuning

I am wondering whether 'deadline' is indeed the best IO scheduler (vs.
anticipatory and cfq) for a soft raid5/6 partition on a server?

What is the common wisdom on the subject among linux-raid users and
developers?

Thanks,
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I have found anticipatory to be the fastest.

http://home.comcast.net/~jpiszcz/sched/cfq_vs_as_vs_deadline_vs_noop.html

Sequential:
Output of CFQ: (horrid): 311,683 KiB/s
 Output of AS: 443,103 KiB/s

Input CFQ is a little faster.

It depends on your workload I suppose.

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-06 Thread Justin Piszcz




On Thu, 6 Dec 2007, David Rees wrote:


On Dec 6, 2007 1:06 AM, Justin Piszcz [EMAIL PROTECTED] wrote:

On Wed, 5 Dec 2007, Jon Nelson wrote:


I saw something really similar while moving some very large (300MB to
4GB) files.
I was really surprised to see actual disk I/O (as measured by dstat)
be really horrible.


Any work-arounds, or just don't perform heavy reads the same time as
writes?


What kernel are you using? (Did I miss it in your OP?)

The per-device write throttling in 2.6.24 should help significantly,
have you tried the latest -rc and compared to your current kernel?

-Dave



2.6.23.9-- thanks will try out the latest -rc or wait for 2.6.24!

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-06 Thread Justin Piszcz




On Thu, 6 Dec 2007, Andrew Morton wrote:


On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
Justin Piszcz [EMAIL PROTECTED] wrote:


I am putting a new machine together and I have dual raptor raid 1 for the
root, which works just fine under all stress tests.

Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
sale now adays):

I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)

And sometime along the way(?) (i had gone to sleep and let it run), this
occurred:

[42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401
action 0x2 frozen


Gee we're seeing a lot of these lately.


[42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
[42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb
0x0 data 512 in
[42880.680292]  res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10
(ATA bus error)
[42881.841899] ata3: soft resetting port
[42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42915.919042] ata3.00: qc timeout (cmd 0xec)
[42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
[42915.919149] ata3.00: revalidation failed (errno=-5)
[42915.919206] ata3: failed to recover some devices, retrying in 5 secs
[42920.912458] ata3: hard resetting port
[42926.411363] ata3: port is slow to respond, please be patient (Status
0x80)
[42930.943080] ata3: COMRESET failed (errno=-16)
[42930.943130] ata3: hard resetting port
[42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42931.413523] ata3.00: configured for UDMA/133
[42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
[42931.413655] ata3: EH complete
[42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
(750156 MB)
[42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
[42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA

Usually when I see this sort of thing with another box I have full of
raptors, it was due to a bad raptor and I never saw it again after I
replaced the disk that it happened on, but that was using the Intel P965
chipset.

For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).

I am going to do some further testing but does this indicate a bad drive?
Bad cable?  Bad connector?

As you can see above, /dev/sdc stopped responding for a little bit and
then the kernel reset the port.

Why is this though?  What is the likely root cause?  Should I replace the
drive?  Obviously this is not normal and cannot be good at all, the idea
is to put these drives in a RAID5 and if one is going to timeout that is
going to cause the array to go degraded and thus be worthless in a raid5
configuration.

Can anyone offer any insight here?


It would be interesting to try 2.6.21 or 2.6.22.



This was due to NCQ issues (disabling it fixed the problem).

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-02 Thread Justin Piszcz




On Sat, 1 Dec 2007, Justin Piszcz wrote:




On Sat, 1 Dec 2007, Janek Kozicki wrote:

Justin Piszcz said: (by the date of Sat, 1 Dec 2007 07:23:41 -0500 
(EST))



dd if=/dev/zero of=/dev/sdc


The purpose is with any new disk its good to write to all the blocks and
let the drive to all of the re-mapping before you put 'real' data on it.
Let it crap out or fail before I put my data on it.


better use badblocks. It writes data, then reads it afterwards:
In this example the data is semi random (quicker than /dev/urandom ;)

badblocks -c 10240 -s -w -t random -v /dev/sdc

--
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Will give this a shot and see if I can reproduce the error, thanks.



The badblocks did not do anything; however, when I built a software raid 5 
and the performed a dd:


/usr/bin/time dd if=/dev/zero of=fill_disk bs=1M

I saw this somewhere along the way:

[30189.967531] RAID5 conf printout:
[30189.967576]  --- rd:3 wd:3
[30189.967617]  disk 0, o:1, dev:sdc1
[30189.967660]  disk 1, o:1, dev:sdd1
[30189.967716]  disk 2, o:1, dev:sde1
[42332.936615] ata5.00: exception Emask 0x2 SAct 0x7000 SErr 0x0 action 
0x2 frozen
[42332.936706] ata5.00: spurious completions during NCQ issue=0x0 
SAct=0x7000 FIS=004040a1:0800
[42332.936804] ata5.00: cmd 61/08:60:6f:4d:2a/00:00:27:00:00/40 tag 12 cdb 
0x0 data 4096 out
[42332.936805]  res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 
(HSM violation)
[42332.936977] ata5.00: cmd 61/08:68:77:4d:2a/00:00:27:00:00/40 tag 13 cdb 
0x0 data 4096 out
[42332.936981]  res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 
(HSM violation)
[42332.937162] ata5.00: cmd 61/00:70:0f:49:2a/04:00:27:00:00/40 tag 14 cdb 
0x0 data 524288 out
[42332.937163]  res 40/00:74:0f:49:2a/00:00:27:00:00/40 Emask 0x2 
(HSM violation)

[42333.240054] ata5: soft resetting port
[42333.494462] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42333.506592] ata5.00: configured for UDMA/133
[42333.506652] ata5: EH complete
[42333.506741] sd 4:0:0:0: [sde] 1465149168 512-byte hardware sectors 
(750156 MB)

[42333.506834] sd 4:0:0:0: [sde] Write Protect is off
[42333.506887] sd 4:0:0:0: [sde] Mode Sense: 00 3a 00 00
[42333.506905] sd 4:0:0:0: [sde] Write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA


Next test, I will turn off NCQ and try to make the problem re-occur.
If anyone else has any thoughts here..?
I ran long smart tests on all 3 disks, they all ran successfully.

Perhaps these drives need to be NCQ BLACKLISTED with the P35 chipset?

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Spontaneous rebuild

2007-12-02 Thread Justin Piszcz




On Sun, 2 Dec 2007, Oliver Martin wrote:


[Please CC me on replies as I'm not subscribed]

Hello!

I've been experimenting with software RAID a bit lately, using two
external 500GB drives. One is connected via USB, one via Firewire. It is
set up as a RAID5 with LVM on top so that I can easily add more drives
when I run out of space.
About a day after the initial setup, things went belly up. First, EXT3
reported strange errors:
EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system
zone - blocks from 106561536, length 1
EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system
zone - blocks from 106561537, length 1
...

There were literally hundreds of these, and they came back immediately
when I reformatted the array. So I tried ReiserFS, which worked fine for
about a day. Then I got errors like these:
ReiserFS: warning: is_tree_node: node level 0 does not match to the
expected one 2
ReiserFS: dm-0: warning: vs-5150: search_by_key: invalid format found in
block 69839092. Fsck?
ReiserFS: dm-0: warning: vs-13070: reiserfs_read_locked_inode: i/o
failure occurred trying to find stat data of [6 10 0x0 SD]

Again, hundreds. So I ran badblocks on the LVM volume, and it reported
some bad blocks near the end. Running badblocks on the md array worked,
so I recreated the LVM stuff and attributed the failures to undervolting
experiments I had been doing (this is my old laptop running as a server).

Anyway, the problems are back: To test my theory that everything is
alright with the CPU running within its specs, I removed one of the
drives while copying some large files yesterday. Initially, everything
seemed to work out nicely, and by the morning, the rebuild had finished.
Again, I unmounted the filesystem and ran badblocks -svn on the LVM. It
ran without gripes for some hours, but just now I saw md had started to
rebuild the array again out of the blue:

Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
using ehci_hcd and address 4
Dec  2 01:06:02 quassel kernel: md: data-check of RAID array md0
Dec  2 01:06:02 quassel kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Dec  2 01:06:02 quassel kernel: md: using maximum available idle IO
bandwidth (but not more than 20 KB/sec) for data-check.
Dec  2 01:06:02 quassel kernel: md: using 128k window, over a total of
488383936 blocks.
Dec  2 03:57:24 quassel kernel: usb 4-5.2: reset high speed USB device
using ehci_hcd and address 4

I'm not sure the USB resets are related to the problem - device 4-5.2 is
part of the array, but I get these sometimes at random intervals and
they don't seem to hurt normally. Besides, the first one was long before
the rebuild started, and the second one long afterwards.

Any ideas why md is rebuilding the array? And could this be related to
the bad blocks problem I had first? badblocks is still running, I'll
post an update when it is finished.
In the meantime, mdadm --detail /dev/md0 and mdadm --examine
/dev/sd[bc]1 don't give me any clues as to what went wrong, both disks
are marked as active sync, and the whole array is active, recovering.

Before I forget, I'm running 2.6.23.1 with this config:
http://stud4.tuwien.ac.at/~e0626486/config-2.6.23.1-hrt3-fw

Thanks,
Oliver
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



It rebuilds the array because 'something' is causing device 
resets/timeouts on your USB device:


Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
using ehci_hcd and address 4

Naturally, when it is reset, the device is disconnected and then 
re-appears, when MD see's this it rebuilds the array.


Why it is timing out/resetting the device, that is what you need to find 
out.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-02 Thread Justin Piszcz




On Sun, 2 Dec 2007, Janek Kozicki wrote:


Justin Piszcz said: (by the date of Sun, 2 Dec 2007 04:11:59 -0500 (EST))


The badblocks did not do anything; however, when I built a software raid 5
and the performed a dd:

/usr/bin/time dd if=/dev/zero of=fill_disk bs=1M

I saw this somewhere along the way:

[42332.936706] ata5.00: spurious completions during NCQ issue=0x0
SAct=0x7000 FIS=004040a1:0800
[42333.240054] ata5: soft resetting port


I know nothing about NCQ ;) But I find it interesting that *slower*
access worked fine while *fast* access didn't.

If I understand you correctly:

- badblocks is slower, and you said that it worked flawlessly, right?
- getting from /dev/zero is the fastest thing you can do, and it fails...

I'd check jumpers on HDD and if there is any, set it to 1.5 Gb speed
instead of default 3.0 Gb. Or sth. along that way. I remember seeing
such jumper on one of my HDDs (I don't remember the exact speed
numbers though).

Also on one forum I remember about problems occurring when HDD was
working at maximum speed, which was faster than the IO controller
could handle.

I dunno. It's just what came to my mind...
--
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Thanks for the suggestions, but BTW NCQ OFF on (raptors anyway) is 30 to 
50 megabytes per second faster in a RAID 5 configuration.  NCQ slows 
things down for those disks.


There are no jumpers (by default) on the 750GB WD Caviar's btw..

So far with NCQ off I've been pounding the disks and have not been able to 
reproduce the error but with NCQ on and some DD's or some raid creations, 
it is reproducible (or appears to be)-- did it twice.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-02 Thread Justin Piszcz




On Mon, 3 Dec 2007, Michael Tokarev wrote:


Justin Piszcz said: (by the date of Sun, 2 Dec 2007 04:11:59 -0500 (EST))


The badblocks did not do anything; however, when I built a software raid 5
and the performed a dd:

/usr/bin/time dd if=/dev/zero of=fill_disk bs=1M

I saw this somewhere along the way:

[42332.936706] ata5.00: spurious completions during NCQ issue=0x0
SAct=0x7000 FIS=004040a1:0800
[42333.240054] ata5: soft resetting port


There's some (probably timing-related) bug with spurious completions
during NCQ.  Alot of people are seeing this same effect with different
drives and controllers.  Tejun is working on it.  It's different to
reproduce.

Search for spurious completion - there are many hits...

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Thanks will check it out.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-02 Thread Justin Piszcz

root  2206 1  4 Dec02 ?00:10:37 dd if /dev/zero of 1.out 
bs 1M
root  2207 1  4 Dec02 ?00:10:38 dd if /dev/zero of 2.out 
bs 1M
root  2208 1  4 Dec02 ?00:10:35 dd if /dev/zero of 3.out 
bs 1M
root  2209 1  4 Dec02 ?00:10:45 dd if /dev/zero of 4.out 
bs 1M
root  2210 1  4 Dec02 ?00:10:35 dd if /dev/zero of 5.out 
bs 1M
root  2211 1  4 Dec02 ?00:10:35 dd if /dev/zero of 6.out 
bs 1M
root  2212 1  4 Dec02 ?00:10:30 dd if /dev/zero of 7.out 
bs 1M
root  2213 1  4 Dec02 ?00:10:42 dd if /dev/zero of 8.out 
bs 1M
root  2214 1  4 Dec02 ?00:10:35 dd if /dev/zero of 9.out 
bs 1M
root  2215 1  4 Dec02 ?00:10:37 dd if /dev/zero of 10.out 
bs 1M
root  3080 24.6  0.0  10356  1672 ?D01:22   5:51 dd if 
/dev/md3 of /dev/null bs 1M


Was curious if when running 10 DD's (which are writing to the RAID 5) 
fine, no issues, suddenly all go into D-state and let the read/give it 
100% priority?


Is this normal?

# du -sb . ; sleep 300; du -sb .
1115590287487   .
1115590287487   .

Here my my raid5 config:

# mdadm -D /dev/md3
/dev/md3:
Version : 00.90.03
  Creation Time : Sun Dec  2 12:15:20 2007
 Raid Level : raid5
 Array Size : 1465143296 (1397.27 GiB 1500.31 GB)
  Used Dev Size : 732571648 (698.63 GiB 750.15 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 3
Persistence : Superblock is persistent

Update Time : Sun Dec  2 22:00:54 2007
  State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 1024K

   UUID : fea48e85:ddd2c33f:d19da839:74e9c858 (local to host box1)
 Events : 0.15

Number   Major   Minor   RaidDevice State
   0   8   330  active sync   /dev/sdc1
   1   8   491  active sync   /dev/sdd1
   2   8   652  active sync   /dev/sde1

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Spontaneous rebuild

2007-12-02 Thread Justin Piszcz




On Mon, 3 Dec 2007, Neil Brown wrote:


On Sunday December 2, [EMAIL PROTECTED] wrote:


Anyway, the problems are back: To test my theory that everything is
alright with the CPU running within its specs, I removed one of the
drives while copying some large files yesterday. Initially, everything
seemed to work out nicely, and by the morning, the rebuild had finished.
Again, I unmounted the filesystem and ran badblocks -svn on the LVM. It
ran without gripes for some hours, but just now I saw md had started to
rebuild the array again out of the blue:

Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
using ehci_hcd and address 4
Dec  2 01:06:02 quassel kernel: md: data-check of RAID array md0

 ^^

Dec  2 01:06:02 quassel kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Dec  2 01:06:02 quassel kernel: md: using maximum available idle IO
bandwidth (but not more than 20 KB/sec) for data-check.

 ^^

Dec  2 01:06:02 quassel kernel: md: using 128k window, over a total of
488383936 blocks.
Dec  2 03:57:24 quassel kernel: usb 4-5.2: reset high speed USB device
using ehci_hcd and address 4



This isn't a resync, it is a data check.  Dec  2 is the first Sunday
of the month.  You probably have a crontab entries that does
  echo check  /sys/block/mdX/md/sync_action

early on the first Sunday of the month.  I know that Debian does this.

It is good to do this occasionally to catch sleeping bad blocks.


While we are on the subject of bad blocks, is it possible to do what 3ware 
raid controllers do without an external card?


They know when a block is bad and they remap it to another part of the 
array etc, where as with software raid you never know this is happening 
until the disk is dead.


For example with 3dm2 it notifies you if you have e-mail alerts set to 2 
(warn) it will e-mail you every time there is a sector re-allocation, is 
this possible with software raid or does it *require* HW raid/external 
controller?


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-01 Thread Justin Piszcz


Quick question,

Setup a new machine last night with two raptor 150 disks.  Setup RAID1 as 
I do everywhere else, 0.90.03 superblocks (in order to be compatible with 
LILO, if you use 1.x superblocks with LILO you can't boot), and then:


/dev/sda1+sdb1 - /dev/md0 - swap
/dev/sda2+sdb2 - /dev/md1 - /boot (ext3)
/dev/sda3+sdb3 - /dev/md2 - / (xfs)

All works fine, no issues...

Quick question though, I turned off the machine, disconnected /dev/sda 
from the machine, boot from /dev/sdb, no problems, shows as degraded 
RAID1.  Turn the machine off.  Re-attach the first drive.  When I boot my 
first partition either re-synced by itself or it was not degraded, was is 
this?


So two questions:

1) If it rebuilt by itself, how come it only rebuilt /dev/md0?
2) If it did not rebuild, is it because the kernel knows it does not need 
to re-calculate parity etc for swap?


I had to:

mdadm /dev/md1 -a /dev/sda2
and
mdadm /dev/md2 -a /dev/sda3

To rebuild the /boot and /, which worked fine, I am just curious though 
why it works like this, I figured it would be all or nothing.


More info:

Not using ANY initramfs/initrd images, everything is compiled into 1 
kernel image (makes things MUCH simpler and the expected device layout etc 
is always the same, unlike initrd/etc).


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Justin Piszcz

I am putting a new machine together and I have dual raptor raid 1 for the 
root, which works just fine under all stress tests.


Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on 
sale now adays):


I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)

And sometime along the way(?) (i had gone to sleep and let it run), this 
occurred:


[42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401 
action 0x2 frozen

[42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
[42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 
0x0 data 512 in
[42880.680292]  res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10 
(ATA bus error)

[42881.841899] ata3: soft resetting port
[42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42915.919042] ata3.00: qc timeout (cmd 0xec)
[42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
[42915.919149] ata3.00: revalidation failed (errno=-5)
[42915.919206] ata3: failed to recover some devices, retrying in 5 secs
[42920.912458] ata3: hard resetting port
[42926.411363] ata3: port is slow to respond, please be patient (Status 
0x80)

[42930.943080] ata3: COMRESET failed (errno=-16)
[42930.943130] ata3: hard resetting port
[42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42931.413523] ata3.00: configured for UDMA/133
[42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
[42931.413655] ata3: EH complete
[42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors 
(750156 MB)

[42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
[42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA


Usually when I see this sort of thing with another box I have full of 
raptors, it was due to a bad raptor and I never saw it again after I 
replaced the disk that it happened on, but that was using the Intel P965 
chipset.


For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of 
the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).


I am going to do some further testing but does this indicate a bad drive? 
Bad cable?  Bad connector?


As you can see above, /dev/sdc stopped responding for a little bit and 
then the kernel reset the port.


Why is this though?  What is the likely root cause?  Should I replace the 
drive?  Obviously this is not normal and cannot be good at all, the idea 
is to put these drives in a RAID5 and if one is going to timeout that is 
going to cause the array to go degraded and thus be worthless in a raid5 
configuration.


Can anyone offer any insight here?

Thank you,

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Justin Piszcz




On Sat, 1 Dec 2007, Jan Engelhardt wrote:



On Dec 1 2007 06:26, Justin Piszcz wrote:

I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)


Why would you care about what's on the disk? fdisk, mkfs and
the day-to-day operation will overwrite it _anyway_.

(If you think the disk is not empty, you should look at it
and copy off all usable warez beforehand :-)



The purpose is with any new disk its good to write to all the blocks and 
let the drive to all of the re-mapping before you put 'real' data on it. 
Let it crap out or fail before I put my data on it.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-01 Thread Justin Piszcz




On Sat, 1 Dec 2007, Jan Engelhardt wrote:



On Dec 1 2007 07:12, Justin Piszcz wrote:

On Sat, 1 Dec 2007, Jan Engelhardt wrote:

On Dec 1 2007 06:19, Justin Piszcz wrote:


RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
you use 1.x superblocks with LILO you can't boot)


Says who? (Don't use LILO ;-)


I like LILO :)


LILO cares much less about disk layout / filesystems than GRUB does,
so I would have expected LILO to cope with all sorts of superblocks.
OTOH I would suspect GRUB to only handle 0.90 and 1.0, where the MDSB
is at the end of the disk = the filesystem SB is at the very beginning.


So two questions:

1) If it rebuilt by itself, how come it only rebuilt /dev/md0?


So md1/md2 was NOT rebuilt?


Correct.


Well it should, after they are readded using -a.
If they still don't, then perhaps another resync is in progress.



There was nothing in progress, md0 was synced up and md1,md2 = degraded.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-01 Thread Justin Piszcz




On Sat, 1 Dec 2007, Jan Engelhardt wrote:



On Dec 1 2007 06:19, Justin Piszcz wrote:


RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
you use 1.x superblocks with LILO you can't boot)


Says who? (Don't use LILO ;-)

I like LILO :)




, and then:

/dev/sda1+sdb1 - /dev/md0 - swap
/dev/sda2+sdb2 - /dev/md1 - /boot (ext3)
/dev/sda3+sdb3 - /dev/md2 - / (xfs)

All works fine, no issues...

Quick question though, I turned off the machine, disconnected /dev/sda
from the machine, boot from /dev/sdb, no problems, shows as degraded
RAID1.  Turn the machine off.  Re-attach the first drive.  When I boot
my first partition either re-synced by itself or it was not degraded,
was is this?


If md0 was not touched (written to) after you disconnected sda, it also
should not be in a degraded state.


So two questions:

1) If it rebuilt by itself, how come it only rebuilt /dev/md0?


So md1/md2 was NOT rebuilt?

Correct.




2) If it did not rebuild, is it because the kernel knows it does not
   need to re-calculate parity etc for swap?


Kernel does not know what's inside an md usually. And it should not
try to be smart.

Ok.




I had to:

mdadm /dev/md1 -a /dev/sda2
and
mdadm /dev/md2 -a /dev/sda3

To rebuild the /boot and /, which worked fine, I am just curious
though why it works like this, I figured it would be all or nothing.


Devices are not automatically readded. Who knows, maybe you inserted a
different disk into sda which you don't want to be overwritten.

Makes sense, I just wanted to confirm that it was normal..




More info:

Not using ANY initramfs/initrd images, everything is compiled into 1
kernel image (makes things MUCH simpler and the expected device layout
etc is always the same, unlike initrd/etc).


My expected device layout is also always the same, _with_ initrd. Why?
Simply because mdadm.conf is copied to the initrd, and mdadm will
use your defined order.


That is another way as well, people seem to be divided.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Justin Piszcz




On Sat, 1 Dec 2007, Janek Kozicki wrote:


Justin Piszcz said: (by the date of Sat, 1 Dec 2007 07:23:41 -0500 (EST))


dd if=/dev/zero of=/dev/sdc


The purpose is with any new disk its good to write to all the blocks and
let the drive to all of the re-mapping before you put 'real' data on it.
Let it crap out or fail before I put my data on it.


better use badblocks. It writes data, then reads it afterwards:
In this example the data is semi random (quicker than /dev/urandom ;)

badblocks -c 10240 -s -w -t random -v /dev/sdc

--
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Will give this a shot and see if I can reproduce the error, thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: raid5 hangs

2007-11-14 Thread Justin Piszcz




On Wed, 14 Nov 2007, Peter Magnusson wrote:


On Wed, 14 Nov 2007, Justin Piszcz wrote:

This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5 
bio* patches are applied.


Ok, good to know.
Do you know when it first appeared because it existed in linux-2.6.22.3 
also...?




I am unsure, I and others started noticing it in 2.6.23 mainly; again, not 
sure, will let others answer this one.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: raid5 hangs

2007-11-14 Thread Justin Piszcz




On Wed, 14 Nov 2007, Bill Davidsen wrote:


Justin Piszcz wrote:
This is a known bug in 2.6.23 and should be fixed in 2.6.23.2 if the RAID5 
bio* patches are applied.


Note below he's running 2.6.22.3 which doesn't have the bug unless -STABLE 
added it. So should not really be in 2.6.22.anything. I assume you're talking 
the endless write or bio issue?

The bio issue is the root cause of the bug yes?
--

I am uncertain but I remember this happening in the past but I thought it 
was something I was doing (possibly  2.6.23) so it may have been 
happenign earlier than that but I am not positive.




Justin.

On Wed, 14 Nov 2007, Peter Magnusson wrote:


Hey.

[1.] One line summary of the problem:

raid5 hangs and use 100% cpu

[2.] Full description of the problem/report:

I have used 2.6.18 for 284 days or something until my powersupply died, no 
problem what so ever duing that time. After that forced reboot I did these 
changes; Put in 2 GB more memory so I have 3 GB instead of 1 GB, two disks 
in the raid5 got badblocks so I didnt trust them anymore so I bought new 
disks (I managed to save the raid5). I have 6x300 GB in a raid5. Two of 
them are now 320 GB so created a small raid1 also. That raid5 is encrypted 
with aes-cbc-plain. The raid1 is encrypted with aes-cbc-essiv:sha256.


I compiled linux-2.6.22.3 and started to use that. I used the same .config
as in default FC5, I think i just selected P4 cpu and preemptive kernel 
type.


After 11 or 12 days the computer froze, I wasnt home when it happend and
couldnt fix it for like 3 days. It was just to reboot it as it wasnt 
possible to login remotely or on console. It did respond to ping however.

After reboot it rebuilded the raid5.

Then it happend again after approx the same time, 11 or 12 days. I noticed
that the process md1_raid5 used 100% cpu all the time. After reboot it
rebuilded the raid5.

I compiled linux-2.6.23.

And then... it happend again... After about the same time as before.
md1_raid5 used 100% cpu. I also noticed that I wasnt able to save
anything in my homedir, it froze during save. I could read from it 
however. My homedir isnt on raid5 but its encrypted. Its not on any disk 
that has to do with raid. This problem didnt happend when I used 2.6.18. 
Currently I use 2.6.18 as I kinda need the computer stable.

After reboot it rebuilded the raid5.


--
bill davidsen [EMAIL PROTECTED]
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-09 Thread Justin Piszcz




On Thu, 8 Nov 2007, Carlos Carvalho wrote:


Jeff Lessem ([EMAIL PROTECTED]) wrote on 6 November 2007 22:00:
Dan Williams wrote:
  The following patch, also attached, cleans up cases where the code looks
  at sh-ops.pending when it should be looking at the consistent
  stack-based snapshot of the operations flags.

I tried this patch (against a stock 2.6.23), and it did not work for
me.  Not only did I/O to the effected RAID5  XFS partition stop, but
also I/O to all other disks.  I was not able to capture any debugging
information, but I should be able to do that tomorrow when I can hook
a serial console to the machine.

I'm not sure if my problem is identical to these others, as mine only
seems to manifest with RAID5+XFS.  The RAID rebuilds with no problem,
and I've not had any problems with RAID5+ext3.

Us too! We're stuck trying to build a disk server with several disks
in a raid5 array, and the rsync from the old machine stops writing to
the new filesystem. It only happens under heavy IO. We can make it
lock without rsync, using 8 simultaneous dd's to the array. All IO
stops, including the resync after a newly created raid or after an
unclean reboot.

We could not trigger the problem with ext3 or reiser3; it only happens
with xfs.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Including XFS mailing list as well can you provide more information to 
them?

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-08 Thread Justin Piszcz




On Thu, 8 Nov 2007, BERTRAND Joël wrote:


BERTRAND Joël wrote:

Chuck Ebbert wrote:

On 11/05/2007 03:36 AM, BERTRAND Joël wrote:

Neil Brown wrote:

On Sunday November 4, [EMAIL PROTECTED] wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME 
COMMAND

root   273  0.0  0.0  0 0 ?DOct21  14:40
[pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00
[pdflush]

After several days/weeks, this is the second time this has happened,
while doing regular file I/O (decompressing a file), everything on
the device went into D-state.

At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2

My linux-2.6.23/drivers/md/raid5.c contains your patch for a long
time :

...
spin_lock(sh-lock);
clear_bit(STRIPE_HANDLE, sh-state);
clear_bit(STRIPE_DELAYED, sh-state);

s.syncing = test_bit(STRIPE_SYNCING, sh-state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
/* Now to look around and see what can be done */

/* clean-up completed biofill operations */
if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
}

rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
struct r5dev *dev = sh-dev[i];
...

but it doesn't fix this bug.



Did that chunk starting with clean-up completed biofill operations end
up where it belongs? The patch with the big context moves it to a 
different

place from where the original one puts it when applied to 2.6.23...

Lately I've seen several problems where the context isn't enough to make
a patch apply properly when some offsets have changed. In some cases a
patch won't apply at all because two nearly-identical areas are being
changed and the first chunk gets applied where the second one should,
leaving nowhere for the second chunk to apply.


I always apply this kind of patches by hands, and no by patch command. 
Last patch sent here seems to fix this bug :


gershwin:[/usr/scripts]  cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
  1464725632 blocks [2/1] [U_]
  [=...]  recovery = 27.1% (396992504/1464725632) 
finish=1040.3min speed=17104K/sec


Resync done. Patch fix this bug.

Regards,

JKB



Excellent!

I cannot easily re-produce the bug on my system so I will wait for the 
next stable patch set to include it and let everyone know if it happens 
again, thanks.

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread Justin Piszcz




On Tue, 6 Nov 2007, BERTRAND Joël wrote:


Done. Here is obtained ouput :

[ 1265.899068] check 4: state 0x6 toread  read 
 write f800fdd4e360 written 
[ 1265.941328] check 3: state 0x1 toread  read 
 write  written 
[ 1265.972129] check 2: state 0x1 toread  read 
 write  written 



For information, after crash, I have :

Root poulenc:[/sys/block]  cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1]
 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU]

Regards,

JKB


After the crash it is not 'resyncing' ?

Justin.

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread Justin Piszcz




On Tue, 6 Nov 2007, BERTRAND Joël wrote:


Justin Piszcz wrote:



On Tue, 6 Nov 2007, BERTRAND Joël wrote:


Done. Here is obtained ouput :

[ 1265.899068] check 4: state 0x6 toread  read 
 write f800fdd4e360 written 
[ 1265.941328] check 3: state 0x1 toread  read 
 write  written 
[ 1265.972129] check 2: state 0x1 toread  read 
 write  written 



For information, after crash, I have :

Root poulenc:[/sys/block]  cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1]
 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU]

Regards,

JKB


After the crash it is not 'resyncing' ?


No, it isn't...

JKB



After any crash/unclean shutdown the RAID should resync, if it doesn't, 
that's not good, I'd suggest running a raid check.


The 'repair' is supposed to clean it, in some cases (md0=swap) it gets 
dirty again.


Tue May  8 09:19:54 EDT 2007: Executing RAID health check for /dev/md0...
Tue May  8 09:19:55 EDT 2007: Executing RAID health check for /dev/md1...
Tue May  8 09:19:56 EDT 2007: Executing RAID health check for /dev/md2...
Tue May  8 09:19:57 EDT 2007: Executing RAID health check for /dev/md3...
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 2176
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md1/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md3/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: The meta-device /dev/md0 has 2176 mismatched 
sectors.

Tue May  8 10:09:58 EDT 2007: Executing repair on /dev/md0
Tue May  8 10:09:59 EDT 2007: The meta-device /dev/md1 has no mismatched 
sectors.
Tue May  8 10:10:00 EDT 2007: The meta-device /dev/md2 has no mismatched 
sectors.
Tue May  8 10:10:01 EDT 2007: The meta-device /dev/md3 has no mismatched 
sectors.

Tue May  8 10:20:02 EDT 2007: All devices are clean...
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 2176
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md1/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md3/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-05 Thread Justin Piszcz




On Mon, 5 Nov 2007, Dan Williams wrote:


On 11/4/07, Justin Piszcz [EMAIL PROTECTED] wrote:



On Mon, 5 Nov 2007, Neil Brown wrote:


On Sunday November 4, [EMAIL PROTECTED] wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]

After several days/weeks, this is the second time this has happened, while
doing regular file I/O (decompressing a file), everything on the device
went into D-state.


At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2



Ah, thanks Neil, will be updating as soon as it is released, thanks.



Are you seeing the same md thread takes 100% of the CPU that Joël is
reporting?



Yes, in another e-mail I posted the top output with md3_raid5 at 100%.

Justin.

2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Justin Piszcz


# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]

After several days/weeks, this is the second time this has happened, while 
doing regular file I/O (decompressing a file), everything on the device 
went into D-state.


# mdadm -D /dev/md3
/dev/md3:
Version : 00.90.03
  Creation Time : Wed Aug 22 10:38:53 2007
 Raid Level : raid5
 Array Size : 1318680576 (1257.59 GiB 1350.33 GB)
  Used Dev Size : 146520064 (139.73 GiB 150.04 GB)
   Raid Devices : 10
  Total Devices : 10
Preferred Minor : 3
Persistence : Superblock is persistent

Update Time : Sun Nov  4 06:38:29 2007
  State : active
 Active Devices : 10
Working Devices : 10
 Failed Devices : 0
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 1024K

   UUID : e37a12d1:1b0b989a:083fb634:68e9eb49
 Events : 0.4309

Number   Major   Minor   RaidDevice State
   0   8   330  active sync   /dev/sdc1
   1   8   491  active sync   /dev/sdd1
   2   8   652  active sync   /dev/sde1
   3   8   813  active sync   /dev/sdf1
   4   8   974  active sync   /dev/sdg1
   5   8  1135  active sync   /dev/sdh1
   6   8  1296  active sync   /dev/sdi1
   7   8  1457  active sync   /dev/sdj1
   8   8  1618  active sync   /dev/sdk1
   9   8  1779  active sync   /dev/sdl1

If I wanted to find out what is causing this, what type of debugging would 
I have to enable to track it down?  Any attempt to read/write files on the 
devices fails (also going into d-state).  Is there any useful information 
I can get currently before rebooting the machine?


# pwd
/sys/block/md3/md
# ls
array_state  dev-sdj1/ rd2@  stripe_cache_active
bitmap_set_bits  dev-sdk1/ rd3@  stripe_cache_size
chunk_size   dev-sdl1/ rd4@  suspend_hi
component_size   layoutrd5@  suspend_lo
dev-sdc1/level rd6@  sync_action
dev-sdd1/metadata_version  rd7@  sync_completed
dev-sde1/mismatch_cnt  rd8@  sync_speed
dev-sdf1/new_dev   rd9@  sync_speed_max
dev-sdg1/raid_disksreshape_position  sync_speed_min
dev-sdh1/rd0@  resync_start
dev-sdi1/rd1@  safe_mode_delay
# cat array_state
active-idle
# cat mismatch_cnt
0
# cat stripe_cache_active
1
# cat stripe_cache_size
16384
# cat sync_action
idle
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdb2[1] sda2[0]
  136448 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
  129596288 blocks [2/2] [UU]

md3 : active raid5 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3] 
sde1[2] sdd1[1] sdc1[0]
  1318680576 blocks level 5, 1024k chunk, algorithm 2 [10/10] 
[UU]


md0 : active raid1 sdb1[1] sda1[0]
  16787776 blocks [2/2] [UU]

unused devices: none
#

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23.1: mdadm/raid5 hung/d-state (md3_raid5 stuck in endless loop?)

2007-11-04 Thread Justin Piszcz

: 68
  high:  186
  batch: 31
cpu: 1 pcp: 1
  count: 9
  high:  62
  batch: 15
  vm stats threshold: 42
cpu: 2 pcp: 0
  count: 79
  high:  186
  batch: 31
cpu: 2 pcp: 1
  count: 10
  high:  62
  batch: 15
  vm stats threshold: 42
cpu: 3 pcp: 0
  count: 47
  high:  186
  batch: 31
cpu: 3 pcp: 1
  count: 60
  high:  62
  batch: 15
  vm stats threshold: 42
  all_unreclaimable: 0
  prev_priority: 12
  start_pfn: 1048576

On Sun, 4 Nov 2007, Justin Piszcz wrote:


# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]

After several days/weeks, this is the second time this has happened, while 
doing regular file I/O (decompressing a file), everything on the device went 
into D-state.


# mdadm -D /dev/md3
/dev/md3:
   Version : 00.90.03
 Creation Time : Wed Aug 22 10:38:53 2007
Raid Level : raid5
Array Size : 1318680576 (1257.59 GiB 1350.33 GB)
 Used Dev Size : 146520064 (139.73 GiB 150.04 GB)
  Raid Devices : 10
 Total Devices : 10
Preferred Minor : 3
   Persistence : Superblock is persistent

   Update Time : Sun Nov  4 06:38:29 2007
 State : active
Active Devices : 10
Working Devices : 10
Failed Devices : 0
 Spare Devices : 0

Layout : left-symmetric
Chunk Size : 1024K

  UUID : e37a12d1:1b0b989a:083fb634:68e9eb49
Events : 0.4309

   Number   Major   Minor   RaidDevice State
  0   8   330  active sync   /dev/sdc1
  1   8   491  active sync   /dev/sdd1
  2   8   652  active sync   /dev/sde1
  3   8   813  active sync   /dev/sdf1
  4   8   974  active sync   /dev/sdg1
  5   8  1135  active sync   /dev/sdh1
  6   8  1296  active sync   /dev/sdi1
  7   8  1457  active sync   /dev/sdj1
  8   8  1618  active sync   /dev/sdk1
  9   8  1779  active sync   /dev/sdl1

If I wanted to find out what is causing this, what type of debugging would I 
have to enable to track it down?  Any attempt to read/write files on the 
devices fails (also going into d-state).  Is there any useful information I 
can get currently before rebooting the machine?


# pwd
/sys/block/md3/md
# ls
array_state  dev-sdj1/ rd2@  stripe_cache_active
bitmap_set_bits  dev-sdk1/ rd3@  stripe_cache_size
chunk_size   dev-sdl1/ rd4@  suspend_hi
component_size   layoutrd5@  suspend_lo
dev-sdc1/level rd6@  sync_action
dev-sdd1/metadata_version  rd7@  sync_completed
dev-sde1/mismatch_cnt  rd8@  sync_speed
dev-sdf1/new_dev   rd9@  sync_speed_max
dev-sdg1/raid_disksreshape_position  sync_speed_min
dev-sdh1/rd0@  resync_start
dev-sdi1/rd1@  safe_mode_delay
# cat array_state
active-idle
# cat mismatch_cnt
0
# cat stripe_cache_active
1
# cat stripe_cache_size
16384
# cat sync_action
idle
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdb2[1] sda2[0]
 136448 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
 129596288 blocks [2/2] [UU]

md3 : active raid5 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3] 
sde1[2] sdd1[1] sdc1[0]
 1318680576 blocks level 5, 1024k chunk, algorithm 2 [10/10] 
[UU]


md0 : active raid1 sdb1[1] sda1[0]
 16787776 blocks [2/2] [UU]

unused devices: none
#

Justin.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Justin Piszcz




On Sun, 4 Nov 2007, BERTRAND Joël wrote:


Justin Piszcz wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]

After several days/weeks, this is the second time this has happened, while 
doing regular file I/O (decompressing a file), everything on the device 
went into D-state.


	Same observation here (kernel 2.6.23). I can see this bug when I try 
to synchronize a raid1 volume over iSCSI (each element is a raid5 volume), or 
sometimes only with a 1,5 TB raid5 volume. When this bug occurs, md subsystem 
eats 100% of one CPU and pdflush remains in D state too. What is your 
architecture ? I use two 32-threads T1000 (sparc64), and I'm trying to 
determine if this bug is arch specific.


Regards,

JKB



Using x86_64 here (Q6600/Intel DG965WH).

Justin.

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Justin Piszcz




On Mon, 5 Nov 2007, Neil Brown wrote:


On Sunday November 4, [EMAIL PROTECTED] wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]

After several days/weeks, this is the second time this has happened, while
doing regular file I/O (decompressing a file), everything on the device
went into D-state.


At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2



Ah, thanks Neil, will be updating as soon as it is released, thanks.

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID1 resync and read errors (loop)

2007-10-26 Thread Justin Piszcz




On Fri, 26 Oct 2007, Filippo Carletti wrote:


Is there a way to control an array resync process?
In particular, is it possible to skip read errors?

My setup:
LVM2 Phisical Volume over a two disks MD RAID1 array
Logical Volumes didn't span whole PV, some PE free at the end of disks

What happened:
disk1 broke
I installed new disk1
and started sync from disk2 to disk1
but at 99.9% disk2 gave some read errors and the sync process started
again, over and over

I didn't notice errors on disk2 because they were in unallocated PEs
at the end of the disk.
The MD device spans the whole disk, while the LVs don't.

I'd like to complete sync ignoring read errors, then replace disk2.

I think this is a not-so-uncommon situation, leaving some PEs free for
future expansion is a good idea and errors go undetected until you
use those free areas.

Thanks.

--
Ciao,
Filippo
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



This is why you should scrub your RAID.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Software RAID when it works and when it doesn't

2007-10-26 Thread Justin Piszcz




On Fri, 26 Oct 2007, Goswin von Brederlow wrote:


Justin Piszcz [EMAIL PROTECTED] writes:


On Fri, 19 Oct 2007, Alberto Alonso wrote:


On Thu, 2007-10-18 at 17:26 +0200, Goswin von Brederlow wrote:

Mike Accetta [EMAIL PROTECTED] writes:



What I would like to see is a timeout driven fallback mechanism. If
one mirror does not return the requested data within a certain time
(say 1 second) then the request should be duplicated on the other
mirror. If the first mirror later unchokes then it remains in the
raid, if it fails it gets removed. But (at least reads) should not
have to wait for that process.

Even better would be if some write delay could also be used. The still
working mirror would get an increase in its serial (so on reboot you
know one disk is newer). If the choking mirror unchokes then it can
write back all the delayed data and also increase its serial to
match. Otherwise it gets really failed. But you might have to use
bitmaps for this or the cache size would limit its usefullnes.

MfG
Goswin


I think a timeout on both: reads and writes is a must. Basically I
believe that all problems that I've encountered issues using software
raid would have been resolved by using a timeout within the md code.

This will keep a server from crashing/hanging when the underlying
driver doesn't properly handle hard drive problems. MD can be
smarter than the dumb drivers.

Just my thoughts though, as I've never got an answer as to whether or
not md can implement its own timeouts.

Alberto


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I have a question with re-mapping sectors, can software raid be as
efficient or good at remapping bad sectors as an external raid
controller for, e.g., raid 10 or raid5?

Justin.


Software raid makes no remapping of bad sectors at all. It assumes the
disks will do sufficient remapping.

MfG
   Goswin



Thanks, this is what I was looking for.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Test

2007-10-25 Thread Justin Piszcz


Success.

On Thu, 25 Oct 2007, Daniel L. Miller wrote:

Sorry for consuming bandwidth - but all of a sudden I'm not seeing messages. 
Is this going through?


--
Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Test 2

2007-10-25 Thread Justin Piszcz


Success 2.

On Thu, 25 Oct 2007, Daniel L. Miller wrote:

Thanks for the test responses - I have re-subscribed...if I see this 
myself...I'm back!

--
Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: flaky controller or disk error?

2007-10-22 Thread Justin Piszcz




On Mon, 22 Oct 2007, Louis-David Mitterrand wrote:


Hi,

[using kernel 2.6.23 and mdadm 2.6.3+20070929]

I have a rather flaky sata controller with which I am trying to resync a raid5
array. It usually starts failing after 40% of the resync is done. Short of
changing the controller (which I will do later this week), is there a way to
have mdmadm resume the resync where it left at reboot time?

Here is the error I am seeing in the syslog. Can this actually be a disk
error?

Oct 18 11:54:34 sylla kernel: ata1.00: exception Emask 0x10 SAct 0x0 
SErr 0x1 action 0x2 frozen
Oct 18 11:54:34 sylla kernel: ata1.00: irq_stat 0x0040, PHY RDY 
changed
Oct 18 11:54:34 sylla kernel: ata1.00: cmd 
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
Oct 18 11:54:34 sylla kernel: res 40/00:00:19:26:33/00:00:3a:00:00/40 
Emask 0x10 (ATA bus error)
Oct 18 11:54:35 sylla kernel: ata1: soft resetting port
Oct 18 11:54:40 sylla kernel: ata1: failed to reset engine 
(errno=-95)4ata1: port is slow to respond, please be patient (Status 0xd0)
Oct 18 11:54:45 sylla kernel: ata1: softreset failed (device not ready)
Oct 18 11:54:45 sylla kernel: ata1: hard resetting port
Oct 18 11:54:46 sylla kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 
SControl 300)
Oct 18 11:54:46 sylla kernel: ata1.00: configured for UDMA/133
Oct 18 11:54:46 sylla kernel: ata1: EH complete
Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] 976773168 512-byte 
hardware sectors (500108 MB)
Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Write Protect is off
Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Oct 18 11:54:46 sylla kernel: sd 0:0:0:0: [sda] Write cache: enabled, 
read cache: enabled, doesn't support DPO or FUA


Thanks,
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I've seen something similiar, it turned out to be a bad disk.

I've also seen it when the cable was loose.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: slow raid5 performance

2007-10-22 Thread Justin Piszcz




On Tue, 23 Oct 2007, Richard Scobie wrote:


Peter wrote:
Thanks Justin, good to hear about some real world experience. 


Hi Peter,

I recently built a 3 drive RAID5 using the onboard SATA controllers on an 
MCP55 based board and get around 115MB/s write and 141MB/s read.


A fourth drive was added some time later and after growing the array and 
filesystem (XFS), saw 160MB/s write and 178MB/s read, with the array 60% 
full.


Regards,

Richard
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Yes, your chipset must be PCI-e based and not PCI.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Software RAID when it works and when it doesn't

2007-10-20 Thread Justin Piszcz




On Sat, 20 Oct 2007, Michael Tokarev wrote:


There was an idea some years ago about having an additional layer on
between a block device and whatever else is above it (filesystem or
something else), that will just do bad block remapping.  Maybe it was
even implemented in LVM or IBM-proposed EVMS (the version that included
in-kernel stuff too, not only the userspace management), but I don't
remember details anymore.  In any case, - but again, if memory serves
me right, -- there was low interest in that because of exactly this --
drives are now more intelligent, there's hardly a notion of bad block
anymore, at least persistent bad block, -- at least visible to the
upper layers.

/mjt



When I run 3dm2 (3ware 3dm2/tools/daemon) I often see LBA remapped sector, 
success, etc..


My question is, how come I do not see this with mdadm/software raid?

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz




On Fri, 19 Oct 2007, Doug Ledford wrote:


On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote:


I'm sure an internal bitmap would.  On RAID1 arrays, reads/writes are
never split up by a chunk size for stripes.  A 2mb read is a single
read, where as on a raid4/5/6 array, a 2mb read will end up hitting a
series of stripes across all disks.  That means that on raid1 arrays,
total disk seeks  total reads/writes, where as on a raid4/5/6, total
disk seeks is usually  total reads/writes.  That in turn implies that
in a raid1 setup, disk seek time is important to performance, but not
necessarily paramount.  For raid456, disk seek time is paramount because
of how many more seeks that format uses.  When you then use an internal
bitmap, you are adding writes to every member of the raid456 array,
which adds more seeks.  The same is true for raid1, but since raid1
doesn't have the same level of dependency on seek rates that raid456
has, it doesn't show the same performance hit that raid456 does.



Got it, so for RAID1 it would make sense if LILO supported it (the
later versions of the md superblock)


Lilo doesn't know anything about the superblock format, however, lilo
expects the raid1 device to start at the beginning of the physical
partition.  In otherwords, format 1.0 would work with lilo.
Did not work when I tried 1.x with LILO, switched back to 00.90.03 and it 
worked fine.





 (for those who use LILO) but for
RAID4/5/6, keep the bitmaps away :)


I still use an internal bitmap regardless ;-)  To help mitigate the cost
of seeks on raid456, you can specify a huge chunk size (like 256k to 2MB
or somewhere in that range).  As long as you can get 90%+ of your
reads/writes to fall into the space of a single chunk, then you start
performing more like a raid1 device without the extra seek overhead.  Of
course, this comes at the expense of peak throughput on the device.
Let's say you were building a mondo movie server, where you were
streaming out digital movie files.  In that case, you very well may care
more about throughput than seek performance since I suspect you wouldn't
have many small, random reads.  Then I would use a small chunk size,
sacrifice the seek performance, and get the throughput bonus of parallel
reads from the same stripe on multiple disks.  On the other hand, if I
was setting up a mail server then I would go with a large chunk size
because the filesystem activities themselves are going to produce lots
of random seeks, and you don't want your raid setup to make that problem
worse.  Plus, most mail doesn't come in or go out at any sort of massive
streaming speed, so you don't need the paralllel reads from multiple
disks to perform well.  It all depends on your particular use scenario.

--
Doug Ledford [EMAIL PROTECTED]
 GPG KeyID: CFBFF194
 http://people.redhat.com/dledford

Infiniband specific RPMs available at
 http://people.redhat.com/dledford/Infiniband


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz




On Fri, 19 Oct 2007, Doug Ledford wrote:


On Fri, 2007-10-19 at 12:45 -0400, Justin Piszcz wrote:


On Fri, 19 Oct 2007, John Stoffel wrote:


Justin == Justin Piszcz [EMAIL PROTECTED] writes:


Justin Is a bitmap created by default with 1.x?  I remember seeing
Justin reports of 15-30% performance degradation using a bitmap on a
Justin RAID5 with 1.x.

Not according to the mdadm man page.  I'd probably give up that
performance if it meant that re-syncing an array went much faster
after a crash.

I certainly use it on my RAID1 setup on my home machine.

John



The performance AFTER a crash yes, but in general usage I remember seeing
someone here doing benchmarks it had a negative affect on performance.


I'm sure an internal bitmap would.  On RAID1 arrays, reads/writes are
never split up by a chunk size for stripes.  A 2mb read is a single
read, where as on a raid4/5/6 array, a 2mb read will end up hitting a
series of stripes across all disks.  That means that on raid1 arrays,
total disk seeks  total reads/writes, where as on a raid4/5/6, total
disk seeks is usually  total reads/writes.  That in turn implies that
in a raid1 setup, disk seek time is important to performance, but not
necessarily paramount.  For raid456, disk seek time is paramount because
of how many more seeks that format uses.  When you then use an internal
bitmap, you are adding writes to every member of the raid456 array,
which adds more seeks.  The same is true for raid1, but since raid1
doesn't have the same level of dependency on seek rates that raid456
has, it doesn't show the same performance hit that raid456 does.



Justin.

--
Doug Ledford [EMAIL PROTECTED]
 GPG KeyID: CFBFF194
 http://people.redhat.com/dledford

Infiniband specific RPMs available at
 http://people.redhat.com/dledford/Infiniband



Got it, so for RAID1 it would make sense if LILO supported it (the 
later versions of the md superblock) (for those who use LILO) but for

RAID4/5/6, keep the bitmaps away :)

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz




On Fri, 19 Oct 2007, John Stoffel wrote:


Doug == Doug Ledford [EMAIL PROTECTED] writes:


Doug On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote:

Justin == Justin Piszcz [EMAIL PROTECTED] writes:



Justin On Fri, 19 Oct 2007, John Stoffel wrote:




So,

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?


Doug 1.0, 1.1, and 1.2 are the same format, just in different positions on
Doug the disk.  Of the three, the 1.1 format is the safest to use since it
Doug won't allow you to accidentally have some sort of metadata between the
Doug beginning of the disk and the raid superblock (such as an lvm2
Doug superblock), and hence whenever the raid array isn't up, you won't be
Doug able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
Doug case situations, I've seen lvm2 find a superblock on one RAID1 array
Doug member when the RAID1 array was down, the system came up, you used the
Doug system, the two copies of the raid array were made drastically
Doug inconsistent, then at the next reboot, the situation that prevented the
Doug RAID1 from starting was resolved, and it never know it failed to start
Doug last time, and the two inconsistent members we put back into a clean
Doug array).  So, deprecating any of these is not really helpful.  And you
Doug need to keep the old 0.90 format around for back compatibility with
Doug thousands of existing raid arrays.

This is a great case for making the 1.1 format be the default.  So
what are the advantages of the 1.0 and 1.2 formats then?  Or should be
we thinking about making two copies of the data on each RAID member,
one at the beginning and one at the end, for resiliency?

I just hate seeing this in the mag page:

   Declare the style of superblock (raid metadata) to be used.
   The default is 0.90 for --create, and to guess for other operations.
   The default can be overridden by setting the metadata value for the
   CREATE keyword in mdadm.conf.

   Options are:

   0, 0.90, default

 Use the original 0.90 format superblock.  This format limits arrays to
 28 component devices and limits compo- nent devices of levels 1 and
 greater to 2 terabytes.

   1, 1.0, 1.1, 1.2

 Use the new version-1 format superblock.  This has few restrictions.
 The different sub-versions store the superblock at different locations
 on the device, either at the end (for 1.0), at the start (for 1.1) or
 4K from the start (for 1.2).


It looks to me that the 1.1, combined with the 1.0 should be what we
use, with the 1.2 format nuked.  Maybe call it 1.3?  *grin*

So at this point I'm not arguing to get rid of the 0.9 format, though
I think it should NOT be the default any more, we should be using the
1.1 combined with 1.0 format.


Is a bitmap created by default with 1.x?  I remember seeing reports of 
15-30% performance degradation using a bitmap on a RAID5 with 1.x.




John
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz




On Fri, 19 Oct 2007, Doug Ledford wrote:


On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote:

Justin == Justin Piszcz [EMAIL PROTECTED] writes:


Justin On Fri, 19 Oct 2007, John Stoffel wrote:



So,

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?


1.0, 1.1, and 1.2 are the same format, just in different positions on
the disk.  Of the three, the 1.1 format is the safest to use since it
won't allow you to accidentally have some sort of metadata between the
beginning of the disk and the raid superblock (such as an lvm2
superblock), and hence whenever the raid array isn't up, you won't be
able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
case situations, I've seen lvm2 find a superblock on one RAID1 array
member when the RAID1 array was down, the system came up, you used the
system, the two copies of the raid array were made drastically
inconsistent, then at the next reboot, the situation that prevented the
RAID1 from starting was resolved, and it never know it failed to start
last time, and the two inconsistent members we put back into a clean
array).  So, deprecating any of these is not really helpful.  And you
need to keep the old 0.90 format around for back compatibility with
thousands of existing raid arrays.


Agree, what is the benefit in deprecating them?  Is there that much old 
code or?





It's certainly easy enough to change mdadm to default to the 1.2
format and to require a --force switch to  allow use of the older
formats.

I keep seeing that we support these old formats, and it's never been
clear to me why we have four different ones available?  Why can't we
start defining the canonical format for Linux RAID metadata?

Thanks,
John
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of
Justin anything else!

Are you sure?  I find that GRUB is much easier to use and setup than
LILO these days.  But hey, just dropping down to support 00.09.03 and
1.2 formats would be fine too.  Let's just lessen the confusion if at
all possible.

John
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Doug Ledford [EMAIL PROTECTED]
 GPG KeyID: CFBFF194
 http://people.redhat.com/dledford

Infiniband specific RPMs available at
 http://people.redhat.com/dledford/Infiniband


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz




On Fri, 19 Oct 2007, John Stoffel wrote:


Justin == Justin Piszcz [EMAIL PROTECTED] writes:


Justin On Fri, 19 Oct 2007, John Stoffel wrote:



So,

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?

It's certainly easy enough to change mdadm to default to the 1.2
format and to require a --force switch to  allow use of the older
formats.

I keep seeing that we support these old formats, and it's never been
clear to me why we have four different ones available?  Why can't we
start defining the canonical format for Linux RAID metadata?

Thanks,
John
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of
Justin anything else!

Are you sure?  I find that GRUB is much easier to use and setup than
LILO these days.  But hey, just dropping down to support 00.09.03 and
1.2 formats would be fine too.  Let's just lessen the confusion if at
all possible.

John



I am sure, I submitted a bug report to the LILO developer, he acknowledged 
the bug but I don't know if it was fixed.


I have not tried GRUB with a RAID1 setup yet.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz




On Fri, 19 Oct 2007, John Stoffel wrote:



So,

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?

It's certainly easy enough to change mdadm to default to the 1.2
format and to require a --force switch to  allow use of the older
formats.

I keep seeing that we support these old formats, and it's never been
clear to me why we have four different ones available?  Why can't we
start defining the canonical format for Linux RAID metadata?

Thanks,
John
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I hope 00.90.03 is not deprecated, LILO cannot boot off of anything else!


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Software RAID when it works and when it doesn't

2007-10-19 Thread Justin Piszcz




On Fri, 19 Oct 2007, Alberto Alonso wrote:


On Thu, 2007-10-18 at 17:26 +0200, Goswin von Brederlow wrote:

Mike Accetta [EMAIL PROTECTED] writes:



What I would like to see is a timeout driven fallback mechanism. If
one mirror does not return the requested data within a certain time
(say 1 second) then the request should be duplicated on the other
mirror. If the first mirror later unchokes then it remains in the
raid, if it fails it gets removed. But (at least reads) should not
have to wait for that process.

Even better would be if some write delay could also be used. The still
working mirror would get an increase in its serial (so on reboot you
know one disk is newer). If the choking mirror unchokes then it can
write back all the delayed data and also increase its serial to
match. Otherwise it gets really failed. But you might have to use
bitmaps for this or the cache size would limit its usefullnes.

MfG
Goswin


I think a timeout on both: reads and writes is a must. Basically I
believe that all problems that I've encountered issues using software
raid would have been resolved by using a timeout within the md code.

This will keep a server from crashing/hanging when the underlying
driver doesn't properly handle hard drive problems. MD can be
smarter than the dumb drivers.

Just my thoughts though, as I've never got an answer as to whether or
not md can implement its own timeouts.

Alberto


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I have a question with re-mapping sectors, can software raid be as 
efficient or good at remapping bad sectors as an external raid controller 
for, e.g., raid 10 or raid5?


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Time to deprecate old RAID formats?

2007-10-19 Thread Justin Piszcz




On Fri, 19 Oct 2007, John Stoffel wrote:


Justin == Justin Piszcz [EMAIL PROTECTED] writes:


Justin Is a bitmap created by default with 1.x?  I remember seeing
Justin reports of 15-30% performance degradation using a bitmap on a
Justin RAID5 with 1.x.

Not according to the mdadm man page.  I'd probably give up that
performance if it meant that re-syncing an array went much faster
after a crash.

I certainly use it on my RAID1 setup on my home machine.

John



The performance AFTER a crash yes, but in general usage I remember seeing 
someone here doing benchmarks it had a negative affect on performance.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: experiences with raid5: stripe_queue patches

2007-10-15 Thread Justin Piszcz




On Mon, 15 Oct 2007, Bernd Schubert wrote:


Hi,

in order to tune raid performance I did some benchmarks with and without the
stripe queue patches. 2.6.22 is only for comparison to rule out other
effects, e.g. the new scheduler, etc.
It seems there is a regression with these patch regarding the re-write
performance, as you can see its almost 50% of what it should be.

write  re-write   read   re-read
480844.26  448723.48  707927.55  706075.02 (2.6.22 w/o SQ patches)
487069.47  232574.30  709038.28  707595.09 (2.6.23 with SQ patches)
469865.75  438649.88  711211.92  703229.00 (2.6.23 without SQ patches)

Benchmark details:

3xraid5 over 4 partitions of the very same hardware raid (in the end thats
raid65, raid6 in hardware and raid5 in software, we need to do that).

chunk size: 8192
stripe_cache_size: 8192 each
readahead of the md*: 65535 (well actually it limits itself to 65528
readahead of the underlying partitions: 16384

filesystem: xfs

Testsystem: 2 x Quadcore Xeon 1.86 GHz (E5320)

An interesting effect to notice: Without these patches the pdflush daemons
will take a lot of CPU time, with these patches, pdflush almost doesn't
appear in the 'top' list.

Actually we would prefer one single raid5 array, but then one single raid5
thread will run with 100% CPU time leaving 7 CPUs idle state, the status of
the hardware raid says its utilization is only at about 50% and we only see
writes at about 200 MB/s.
On the contrary, with 3 different software raid5 sets the i/o to the harware
raid systems is the bottleneck.

Is there any chance to parallize the raid5 code? I think almost everything is
done in raid5.c make_request(), but the main loop there is spin_locked by
prepare_to_wait(). Would it be possible not to lock this entire loop?


Thanks,
Bernd

--
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Excellent questions I look forward to reading this thread :)

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID 5: weird size results after Grow

2007-10-13 Thread Justin Piszcz




On Sat, 13 Oct 2007, Marko Berg wrote:


Bill Davidsen wrote:

Marko Berg wrote:
I added a fourth drive to a RAID 5 array. After some complications related 
to adding a new HD controller at the same time, and thus changing some 
device names, I re-created the array and got it working (in the sense 
nothing degraded). But size results are weird. Each component partition 
is 320 G, does anyone have an explanation for the Used Dev Size field 
value below? The 960 G total size is as it should be, but in practice 
Linux reports the array only having 625,019,608 blocks.


I don't see that number below, what command reported this?


For instance df:

$ df
Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/md0 625019608 358223356 235539408  61% /usr/pub

How can this be, even though the array should be clean with 4 active 
devices?


$  mdadm -D /dev/md0
/dev/md0:
   Version : 01.02.03
 Creation Time : Sat Oct 13 01:25:26 2007
Raid Level : raid5
Array Size : 937705344 (894.27 GiB 960.21 GB)
 Used Dev Size : 625136896 (298.09 GiB 320.07 GB)
  Raid Devices : 4
 Total Devices : 4
Preferred Minor : 0
   Persistence : Superblock is persistent

   Update Time : Sat Oct 13 05:11:38 2007
 State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
 Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

  Name : 0
  UUID : 9bf903f8:7fc9eec1:2ff25011:37e9607b
Events : 2

   Number   Major   Minor   RaidDevice State
  0 25320  active sync 
/dev/VolGroup01/LogVol02

  1   8   331  active sync   /dev/sdc1
  2   8   492  active sync   /dev/sdd1
  3   8   173  active sync   /dev/sdb1


Results for mdadm -E partition on all devices appear like this one, with 
positions changed:


$ mdadm -E /dev/sdc1
/dev/sdc1:
 Magic : a92b4efc
   Version : 1.2
   Feature Map : 0x0
Array UUID : 9bf903f8:7fc9eec1:2ff25011:37e9607b
  Name : 0
 Creation Time : Sat Oct 13 01:25:26 2007
Raid Level : raid5
  Raid Devices : 4

 Used Dev Size : 625137010 (298.09 GiB 320.07 GB)
Array Size : 1875410688 (894.27 GiB 960.21 GB)
 Used Size : 625136896 (298.09 GiB 320.07 GB)
   Data Offset : 272 sectors
  Super Offset : 8 sectors
 State : clean
   Device UUID : 9b2037fb:231a8ebf:1aaa5577:140795cc

   Update Time : Sat Oct 13 10:56:02 2007
  Checksum : c729f5a1 - correct
Events : 2

Layout : left-symmetric
Chunk Size : 64K

   Array Slot : 1 (0, 1, 2, 3)
  Array State : uUuu


Particularly, Used Dev Size and Used Size report an amount twice the 
size of the partition (and device). Array size is here twice the actual 
size, even though shown correctly within parentheses.


Sectors are 512 bytes.


So Used Dev Size above uses sector size, while Array Size uses 1k blocks? 
I'm pretty sure, though, that previously Used Dev Size was in 1k blocks 
too. That's also what most of the examples in the net seem to have.



Finally, mdstat shows the block count as it should be.

$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb1[3] sdd1[2] sdc1[1] dm-2[0]
 937705344 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/4] 
[]

unused devices: none


Any suggestions on how to fix this, or what to investigate next, would be 
appreciated!


I'm not sure what you're trying to fix here, everything you posted looks 
sane.


I'm trying to find the missing 300 GB that, as df reports, are not available. 
I ought to have a 900 GB array, consisting of four 300 GB devices, while only 
600 GB are available. Adding the fourth device didn't increase the capacity 
of the array (visible, at least). E.g. fdisk reports the array size to be 900 
G, but df still claims 600 capacity. Any clues why?


--
Marko
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



You have to expand the filesystem.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID 5: weird size results after Grow

2007-10-13 Thread Justin Piszcz




On Sat, 13 Oct 2007, Marko Berg wrote:


Corey Hickey wrote:

Marko Berg wrote:

Bill Davidsen wrote:

Marko Berg wrote:
Any suggestions on how to fix this, or what to investigate next, would 
be appreciated!


I'm not sure what you're trying to fix here, everything you posted 
looks sane.


I'm trying to find the missing 300 GB that, as df reports, are not 
available. I ought to have a 900 GB array, consisting of four 300 GB 
devices, while only 600 GB are available. Adding the fourth device didn't 
increase the capacity of the array (visible, at least). E.g. fdisk reports 
the array size to be 900 G, but df still claims 600 capacity. Any clues 
why?


df reports the size of the filesystem, which is still about 600GB--the 
filesystem doesn't resize automatically when the size of the underlying 
device changes.


You'll need to use resize2fs, resize_reiserfs, or whatever other tool is 
appropriate for your type of filesystem.


-Corey 


Right, so this isn't one of my sharpest days... Thanks a bunch, Corey!

--
Marko
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Ah, already answered.

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID 5 performance issue.

2007-10-11 Thread Justin Piszcz




On Thu, 11 Oct 2007, Andrew Clayton wrote:


On Thu, 11 Oct 2007 13:06:39 -0400, Bill Davidsen wrote:


Andrew Clayton wrote:

On Fri, 5 Oct 2007 16:56:03 -0400, John Stoffel wrote:

  Can you start a 'vmstat 1' in one window, then start whatever
  you do

to get crappy performance.  That would be interesting to see.
   

In trying to find something simple that can show the problem I'm
seeing. I think I may have found the culprit.

Just testing on my machine at home, I made this simple program.

/* fslattest.c */

#define _GNU_SOURCE

#include stdio.h
#include stdlib.h
#include unistd.h
#include sys/stat.h
#include sys/types.h
#include fcntl.h
#include string.h


int main(int argc, char *argv[])
{
char file[255];

if (argc  2) {
printf(Usage: fslattest file\n);
exit(1);
}

strncpy(file, argv[1], 254);
printf(Opening %s\n, file);

while (1) {
int testfd = open(file, 
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600); close(testfd);
unlink(file);
sleep(1);
}

exit(0);
}


If I run this program under strace in my home directory (XFS file
system on a (new) disk (no raid involved) all to its own.like

$ strace -T -e open ./fslattest test

It doesn't looks too bad.

open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.005043 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000212
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.016844

If I then start up a dd in the same place.

$ dd if=/dev/zero of=bigfile bs=1M count=500

Then I see the problem I'm seeing at work.

open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
2.000348 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 1.594441
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
2.224636 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 1.074615

Doing the same on my other disk which is Ext3 and contains the root
fs, it doesn't ever stutter

open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.015423 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.92
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.93 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.88
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.000103 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.96
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.94 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000114
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.91 open(test,
O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3 0.000274
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC|O_LARGEFILE, 0600) = 3
0.000107


Somewhere in there was the dd, but you can't tell.

I've found if I mount the XFS filesystem with nobarrier, the
latency is reduced to about 0.5 seconds with occasional spikes  1
second.

When doing this on the raid array.

open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.009164
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.71
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.002667

dd kicks in

open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 11.580238
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 3.94
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.63
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 4.297978

dd finishes 
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.000199
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.013413
open(test, O_WRONLY|O_CREAT|O_EXCL|O_TRUNC, 0600) = 3 0.025134


I guess I should take this to the XFS folks.


Try mounting the filesystem noatime and see if that's part of the
problem.


Yeah, it's mounted noatime. Looks like I tracked this down to an XFS
regression.

http://marc.info/?l=linux-fsdevelm=119211228609886w=2

Cheers,

Andrew
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Nice!  Thanks for reporting the final result, 1-2 weeks of 
debugging/discussion, nice you found it.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID 5 performance issue.

2007-10-08 Thread Justin Piszcz




On Sun, 7 Oct 2007, Dean S. Messing wrote:



Justin Piszcz wrote:

On Fri, 5 Oct 2007, Dean S. Messing wrote:


Brendan Conoboy wrote:
snip

Is the onboard SATA controller real SATA or just an ATA-SATA
converter?  If the latter, you're going to have trouble getting faster
performance than any one disk can give you at a time.  The output of
'lspci' should tell you if the onboard SATA controller is on its own
bus or sharing space with some other device.  Pasting the output here
would be useful.

snip

N00bee question:

How does one tell if a machine's disk controller is an ATA-SATA
converter?

The output of `lspci|fgrep -i sata' is:

00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI Controller\
(rev 09)

suggests a real SATA. These references to ATA in dmesg, however,
make me wonder.

ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7: WDC WD1600JS-75NCB3, 10.02E04, max UDMA/133
ata1.00: 31250 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-7: ST3160812AS, 3.ADJ, max UDMA/133
ata2.00: 31250 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata2.00: configured for UDMA/133
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: ATA-7: ST3500630NS, 3.AEK, max UDMA/133
ata3.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata3.00: configured for UDMA/133


Dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



His drives are either really old and do not support NCQ or he is not using
AHCI in the BIOS.


Sorry, Justin, if I wasn't clear.  I was asking the N00bee question
about _my_own_ machine.  The output of lspci (on my machine) seems to
indicate I have a real STAT controller on the Motherboard, but the
contents of dmesg, with the references to ATA-7 and UDMA/133, made
me wonder if I had just an ATA-SATA converter.  Hence my question: how
does one tell definitively if one has a real SATA controller on the Mother
Board?



The output looks like a real (AHCI-capable) SATA controller and your 
drives are using NCQ/AHCI.


Output from one of my machines:
[   23.621462] ata1: SATA max UDMA/133 cmd 0xf8812100 ctl 0x bmdma 
0x irq 219

[   24.078390] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   24.549806] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

As far as why it shows UDMA/133 in the kernel output I am sure there is a 
reason :)


I know in the older SATA drives there was a bridge chip that was used to 
convert the drive from IDE-SATA maybe it is from those legacy days, not 
sure.


With the newer NCQ/'native' SATA drives, the bridge chip should no longer 
exist.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Justin Piszcz




On Mon, 8 Oct 2007, Janek Kozicki wrote:


Hello,

Recently I started to use mdadm and I'm very impressed by its
capabilities.

I have raid0 (250+250 GB) on my workstation. And I want to have
raid5 (4*500 = 1500 GB) on my backup machine.

The backup machine currently doesn't have raid, just a single 500 GB
drive. I plan to buy more HDDs to have a bigger space for my
backups but since I cannot afford all HDDs at once I face a problem
of expanding an array. I'm able to add one 500 GB drive every few
months until I have all 4 drives.

But I cannot make a backup of a backup... so reformatting/copying all
data each time when I add new disc to the array is not possible for me.

Is it possible anyhow to create a very degraded raid array - a one
that consists of 4 drives, but has only TWO ?

This would involve some very tricky *hole* management on the block
device... A one that places holes in stripes on the block device,
until more discs are added to fill the holes. When the holes are
filled, the block device grows bigger, and with lvm I just increase
the filesystem size. This is perhaps coupled with some unstripping
that moves/reorganizes blocks around to fill/defragment the holes.

is it just a pipe dream?

best regards


PS: yes it's simple to make a degraded array of 3 drives, but I
cannot afford two discs at once...

--
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



With raid1 you can create a degraded array with 1 disk- I have done this, 
I have always wondered if mdadm will let you make a degraded raid 5 array 
with 2 disks (you'd specify 3 and only give 2) - you can always expand 
later.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 >

1 - 100 of 339 matches

Mail list logo