Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread dean gaudet
On Tue, 25 Dec 2007, Bill Davidsen wrote:

 The issue I'm thinking about is hardware sector size, which on modern drives
 may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
 when writing a 512b block.

i'm not sure any shipping SATA disks have larger than 512B sectors yet... 
do you know of any?  (or is this thread about SCSI which i don't pay 
attention to...)

on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:

255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

so sda1 starts at a non-multiple of 4096 into the disk.

i ran some random seek+write experiments using
http://arctic.org/~dean/randomio/, here are the results using 512 byte
and 4096 byte writes (fsync after each write), 8 threads, on sda1:

# ./randomio /dev/sda1 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  148.5 |0.0   infnan0.0nan |  148.5   0.2   53.7   89.3   19.5
  129.2 |0.0   infnan0.0nan |  129.2  37.2   61.9   96.79.3
  131.2 |0.0   infnan0.0nan |  131.2  40.3   61.0   90.49.3
  132.0 |0.0   infnan0.0nan |  132.0  39.6   60.6   89.39.1
  130.7 |0.0   infnan0.0nan |  130.7  39.8   61.3   98.18.9
  131.4 |0.0   infnan0.0nan |  131.4  40.0   60.8  101.09.6
# ./randomio /dev/sda1 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  141.7 |0.0   infnan0.0nan |  141.7   0.3   56.3   99.3   21.1
  132.4 |0.0   infnan0.0nan |  132.4  43.3   60.4   91.88.5
  131.6 |0.0   infnan0.0nan |  131.6  41.4   60.9  111.09.6
  131.8 |0.0   infnan0.0nan |  131.8  41.4   60.7   85.38.6
  130.6 |0.0   infnan0.0nan |  130.6  41.7   61.3   95.09.4
  131.4 |0.0   infnan0.0nan |  131.4  42.2   60.8   90.58.4


i think the anomalous results in the first 10s samples are perhaps the drive
coming out of a standby state.

and here are the results aligned using the sda raw device itself:

# ./randomio /dev/sda 8 1 1 512 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  147.3 |0.0   infnan0.0nan |  147.3   0.3   54.1   93.7   20.1
  132.4 |0.0   infnan0.0nan |  132.4  37.4   60.6   91.89.2
  132.5 |0.0   infnan0.0nan |  132.5  37.7   60.3   93.79.3
  131.8 |0.0   infnan0.0nan |  131.8  39.4   60.7   92.79.0
  133.9 |0.0   infnan0.0nan |  133.9  41.7   59.8   90.78.5
  130.2 |0.0   infnan0.0nan |  130.2  40.8   61.5   88.68.9
# ./randomio /dev/sda 8 1 1 4096 10 6
  total |  read: latency (ms)   |  write:latency (ms)
   iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
  145.4 |0.0   infnan0.0nan |  145.4   0.3   54.9   94.0   20.1
  130.3 |0.0   infnan0.0nan |  130.3  36.0   61.4   92.79.6
  130.6 |0.0   infnan0.0nan |  130.6  38.2   61.2   96.79.2
  132.1 |0.0   infnan0.0nan |  132.1  39.0   60.5   93.59.2
  131.8 |0.0   infnan0.0nan |  131.8  43.1   60.8   93.89.1
  129.0 |0.0   infnan0.0nan |  129.0  40.2   62.0   96.48.8

it looks pretty much the same to me...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread Justin Piszcz



On Sat, 29 Dec 2007, dean gaudet wrote:


On Tue, 25 Dec 2007, Bill Davidsen wrote:


The issue I'm thinking about is hardware sector size, which on modern drives
may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle
when writing a 512b block.


i'm not sure any shipping SATA disks have larger than 512B sectors yet...
do you know of any?  (or is this thread about SCSI which i don't pay
attention to...)

on a brand new WDC WD7500AAKS-00RBA0 with this partition layout:

255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

so sda1 starts at a non-multiple of 4096 into the disk.

i ran some random seek+write experiments using
http://arctic.org/~dean/randomio/, here are the results using 512 byte
and 4096 byte writes (fsync after each write), 8 threads, on sda1:

# ./randomio /dev/sda1 8 1 1 512 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 148.5 |0.0   infnan0.0nan |  148.5   0.2   53.7   89.3   19.5
 129.2 |0.0   infnan0.0nan |  129.2  37.2   61.9   96.79.3
 131.2 |0.0   infnan0.0nan |  131.2  40.3   61.0   90.49.3
 132.0 |0.0   infnan0.0nan |  132.0  39.6   60.6   89.39.1
 130.7 |0.0   infnan0.0nan |  130.7  39.8   61.3   98.18.9
 131.4 |0.0   infnan0.0nan |  131.4  40.0   60.8  101.09.6
# ./randomio /dev/sda1 8 1 1 4096 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 141.7 |0.0   infnan0.0nan |  141.7   0.3   56.3   99.3   21.1
 132.4 |0.0   infnan0.0nan |  132.4  43.3   60.4   91.88.5
 131.6 |0.0   infnan0.0nan |  131.6  41.4   60.9  111.09.6
 131.8 |0.0   infnan0.0nan |  131.8  41.4   60.7   85.38.6
 130.6 |0.0   infnan0.0nan |  130.6  41.7   61.3   95.09.4
 131.4 |0.0   infnan0.0nan |  131.4  42.2   60.8   90.58.4


i think the anomalous results in the first 10s samples are perhaps the drive
coming out of a standby state.

and here are the results aligned using the sda raw device itself:

# ./randomio /dev/sda 8 1 1 512 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 147.3 |0.0   infnan0.0nan |  147.3   0.3   54.1   93.7   20.1
 132.4 |0.0   infnan0.0nan |  132.4  37.4   60.6   91.89.2
 132.5 |0.0   infnan0.0nan |  132.5  37.7   60.3   93.79.3
 131.8 |0.0   infnan0.0nan |  131.8  39.4   60.7   92.79.0
 133.9 |0.0   infnan0.0nan |  133.9  41.7   59.8   90.78.5
 130.2 |0.0   infnan0.0nan |  130.2  40.8   61.5   88.68.9
# ./randomio /dev/sda 8 1 1 4096 10 6
 total |  read: latency (ms)   |  write:latency (ms)
  iops |   iops   minavgmax   sdev |   iops   minavgmax   sdev
+---+--
 145.4 |0.0   infnan0.0nan |  145.4   0.3   54.9   94.0   20.1
 130.3 |0.0   infnan0.0nan |  130.3  36.0   61.4   92.79.6
 130.6 |0.0   infnan0.0nan |  130.6  38.2   61.2   96.79.2
 132.1 |0.0   infnan0.0nan |  132.1  39.0   60.5   93.59.2
 131.8 |0.0   infnan0.0nan |  131.8  43.1   60.8   93.89.1
 129.0 |0.0   infnan0.0nan |  129.0  40.2   62.0   96.48.8

it looks pretty much the same to me...

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Good to know/have it confirmed by someone else, the alignment does not 
matter with Linux/SW RAID.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread Michael Tokarev
Justin Piszcz wrote:
[]
 Good to know/have it confirmed by someone else, the alignment does not
 matter with Linux/SW RAID.

Alignment matters when one partitions Linux/SW raid array.
If the inside partitions will not be aligned on a stripe
boundary, esp. in the worst case when the filesystem blocks
cross the stripe boundary (wonder if it's ever possible...
and I think it is, if a partition starts at some odd 512
bytes boundary, and filesystem block size is 4Kb, there's
just no chance for an inside filesystem to do full-stripe
writes, ever, so (modulo stripe cache size) all writes will
go read-modify-write or similar way.

And that's what the original article is about, by the way.
It just happens that hardware raid array is more often split
into partitions (using native tools) than linux software raid
arrays.

And that's what has been pointed out in this thread, as well... ;)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-25 Thread Bill Davidsen

Robin Hill wrote:

On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:

  

The (up to) 30% percent figure is mentioned here:
http://insights.oetiker.ch/linux/raidoptimization.html



That looks to be referring to partitioning a RAID device - this'll only
apply to hardware RAID or partitionable software RAID, not to the normal
use case.  When you're creating an array out of standard partitions then
you know the array stripe size will align with the disks (there's no way
it cannot), and you can set the filesystem stripe size to align as well
(XFS will do this automatically).

I've actually done tests on this with hardware RAID to try to find the
correct partition offset, but wasn't able to see any difference (using
bonnie++ and moving the partition start by one sector at a time).

  

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

   Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect




This looks to be a normal disk - the partition offsets shouldn't be
relevant here (barring any knowledge of the actual physical disk layout
anyway, and block remapping may well make that rather irrelevant).
  
The issue I'm thinking about is hardware sector size, which on modern 
drives may be larger than 512b and therefore entail a read-alter-rewrite 
(RAR) cycle when writing a 512b block. With larger writes, if the 
alignment is poor and the write size is some multiple of 512, it's 
possible to have an RAR at each end of the write. The only way to have a 
hope of controlling the alignment is to write a raw device or use a 
filesystem which can be configured to have blocks which are a multiple 
of the sector size and to do all i/o in block size starting each file on 
a block boundary. That may be possible with ext[234] set up properly.


Why this is important: the physical layout of the drive is useful, but 
for a large write the drive will have to make some number of steps from 
on cylinder to another. By carefully choosing the starting point, the 
best improvement will be to eliminate 2 track-to-track seek times, one 
at the start and one at the end. If the writes are small only one t2t 
saving is possible.


Now consider a RAR process. The drive is spinning typically at 7200 rpm, 
or 8.333 ms/rev. A read might take .5 rev on average, and a RAR will 
take 1.5 rev, because it takes a full revolution after the original data 
is read before the altered data can be rewritten. Larger sectors give 
more capacity, but reduced performance for write. And doing small writes 
can result in paying the RAR penalty on every write. So there may be a 
measurable benefit to getting that alignment right at the drive level.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Justin Piszcz



On Thu, 20 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Bill Davidsen wrote:

I'm going to try another approach, I'll describe it when I get results (or 
not).


http://home.comcast.net/~jpiszcz/align_vs_noalign/

Hardly any difference at whatsoever, only on the per char for read/write is 
it any faster..?


Am I misreading what you are doing here... you have the underlying data on 
the actual hardware devices 64k aligned by using either the whole device or 
starting a partition on a 64k boundary? I'm dubious that you will see a 
difference any other way, after all the translations take place.


I'm trying creating a raid array using loop devices created with the offset 
parameter, but I suspect that I will wind up doing a test after just 
repartitioning the drives, painful as that will be.


Average of 3 runs taken:

$ cat align/*log|grep ,
p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:10:16/64,1334210,10,330,2,120,1,3978,10,312,2 
p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:10:16/64,1252548,6,296,1,115,1,7927,20,373,2 
p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:10:16/64,1242445,6,303,1,117,1,6767,17,359,2 


$ cat noalign/*log|grep ,
p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:10:16/64,1353180,8,314,1,117,1,8684,22,283,2 
p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:10:16/64,12211519,29,297,1,113,1,3218,8,325,2 
p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:10:16/64,1243229,8,307,1,120,1,4247,11,313,2 




--
Bill Davidsen [EMAIL PROTECTED]
Woe unto the statesman who makes war without a reason that will still
be valid when the war is over... Otto von Bismark 



1. The first I made partitions on each drive like I normally do.
2. The second test was I followed the EMC document on how to properly 
align the partitions and I followed Microsoft's document on how to 
calculate the correct offset, I used 512 for 256k stripe.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz

The (up to) 30% percent figure is mentioned here:
http://insights.oetiker.ch/linux/raidoptimization.html

On http://forums.storagereview.net/index.php?showtopic=25786:
This user writes about the problem:
XP, and virtually every O/S and partitioning software of XP's day, by default 
places the first partition on a disk at sector 63. Being an odd number, and 
31.5KB into the drive, it isn't ever going to align with any stripe size. This 
is an unfortunate industry standard.
Vista on the other hand, aligns the first partition on sector 2048 by default 
as a by-product of it's revisions to support large-sector sized hard drives. As 
RAID5 arrays in write mode mimick the performance characteristics of 
large-sector size hard drives, this comes as a great if not inadvertent 
benefit. 2048 is evenly divisible by 2 and 4 (allowing for 3 and 5 drive arrays 
optimally) and virtually every stripe size in common use. If you are however 
using a 4-drive RAID5, you're SOOL.

Page 9 in this PDF (EMC_BestPractice_R22.pdf) shows the problem graphically:
http://bbs.doit.com.cn/attachment.php?aid=6757

--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

   Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid autodetect

---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe aligned?


Or is there a better way to do this, does parted handle this situation 
better?


What is the best (and correct) way to calculate stripe-alignment on the 
RAID5 device itself?


---

The EMC paper recommends:

Disk partition adjustment for Linux systems
In Linux, align the partition table before data is written to the LUN, as 
the partition map will be rewritten
and all data on the LUN destroyed. In the following example, the LUN is 
mapped to
/dev/emcpowerah, and the LUN stripe element size is 128 blocks. Arguments 
for the fdisk utility are

as follows:
fdisk/dev/emcpowerah
x  # expert mode
b  # adjust starting block number
1  # choose partition 1
128 #set it to 128, our stripe element size
w  # write the new partition

---

Does this also apply to Linux/SW RAID5?  Or are there any caveats that are 
not taken into account since it is based in SW vs. HW?


---

What it currently looks like:

Command (m for help): x

Expert command (m for help): p

Disk /dev/sdc: 255 heads, 63 sectors, 18241 cylinders

Nr AF  Hd Sec  Cyl  Hd Sec  Cyl Start  Size ID
 1 00   1   10 254  63 1023 63  293041602 fd
 2 00   0   00   0   00  0  0 00
 3 00   0   00   0   00  0  0 00
 4 00   0   00   0   00  0  0 00

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Mattias Wadenstein

On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid autodetect

---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct start 
and end size if I wanted to make sure the RAID5 was stripe aligned?


Or is there a better way to do this, does parted handle this situation 
better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw 
raid) and sw raid members.


/Mattias Wadenstein
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe aligned?


Or is there a better way to do this, does parted handle this situation 
better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw raid) 
and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices 
again I did not plug them back in the same order, everything worked of 
course but when I ran LILO it said it was not part of the RAID set, 
because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, if 
I had not used partitions here, I'd have lost (or more of the drives) due 
to a bad LILO run?


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Justin Piszcz [EMAIL PROTECTED] wrote:


 On Wed, 19 Dec 2007, Mattias Wadenstein wrote:
  From that setup it seems simple, scrap the partition table and use the
  disk device for raid. This is what we do for all data storage disks (hw 
  raid)
  and sw raid members.
 
  /Mattias Wadenstein
 

 Is there any downside to doing that?  I remember when I had to take my

There is one (just pointed out to me yesterday): having the partition
and having it labeled as raid makes identification quite a bit easier
for humans and software, too.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Jon Nelson wrote:


On 12/19/07, Justin Piszcz [EMAIL PROTECTED] wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:

From that setup it seems simple, scrap the partition table and use the

disk device for raid. This is what we do for all data storage disks (hw raid)
and sw raid members.

/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my


There is one (just pointed out to me yesterday): having the partition
and having it labeled as raid makes identification quite a bit easier
for humans and software, too.

--
Jon



Some nice graphs found here:
http://sqlblog.com/blogs/linchi_shea/archive/2007/02/01/performance-impact-of-disk-misalignment.aspx

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Bill Davidsen

Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the 
correct start and end size if I wanted to make sure the RAID5 was 
stripe aligned?


Or is there a better way to do this, does parted handle this 
situation better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks 
(hw raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices 
again I did not plug them back in the same order, everything worked of 
course but when I ran LILO it said it was not part of the RAID set, 
because /dev/sda had become /dev/sdg and overwrote the MBR on the 
disk, if I had not used partitions here, I'd have lost (or more of the 
drives) due to a bad LILO run?


As other posts have detailed, putting the partition on a 64k aligned 
boundary can address the performance problems. However, a poor choice of 
chunk size, cache_buffer size, or just random i/o in small sizes can eat 
up a lot of the benefit.


I don't think you need to give up your partitions to get the benefit of 
alignment.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe aligned?


Or is there a better way to do this, does parted handle this situation 
better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw 
raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices again 
I did not plug them back in the same order, everything worked of course but 
when I ran LILO it said it was not part of the RAID set, because /dev/sda 
had become /dev/sdg and overwrote the MBR on the disk, if I had not used 
partitions here, I'd have lost (or more of the drives) due to a bad LILO 
run?


As other posts have detailed, putting the partition on a 64k aligned boundary 
can address the performance problems. However, a poor choice of chunk size, 
cache_buffer size, or just random i/o in small sizes can eat up a lot of the 
benefit.


I don't think you need to give up your partitions to get the benefit of 
alignment.


--
Bill Davidsen [EMAIL PROTECTED]
Woe unto the statesman who makes war without a reason that will still
be valid when the war is over... Otto von Bismark 



Hrmm..

I am doing a benchmark now with:

6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.

unligned, just fdisk /dev/sdc, mkpartition, fd raid.
 aligned, fdisk, expert, start at 512 as the off-set

Per a Microsoft KB:

Example of alignment calculations in kilobytes for a 256-KB stripe unit 
size:

(63 * .5) / 256 = 0.123046875
(64 * .5) / 256 = 0.125
(128 * .5) / 256 = 0.25
(256 * .5) / 256 = 0.5
(512 * .5) / 256 = 1
These examples shows that the partition is not aligned correctly for a 
256-KB stripe unit size until the partition is created by using an offset 
of 512 sectors (512 bytes per sector).


So I should start at 512 for a 256k chunk size.

I ran bonnie++ three consecutive times and took the average for the 
unaligned, rebuilding the RAID5 now and then I will re-execute the test 3 
additional times and take the average of that.


Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Bill Davidsen [EMAIL PROTECTED] wrote:
 As other posts have detailed, putting the partition on a 64k aligned
 boundary can address the performance problems. However, a poor choice of
 chunk size, cache_buffer size, or just random i/o in small sizes can eat
 up a lot of the benefit.

 I don't think you need to give up your partitions to get the benefit of
 alignment.

How might that benefit be realized?
Assume I have 3 disks, /dev/sd{b,c,d} all partitioned identically with
4 partitions, and I want to use /dev/sd{b,c,d}3 for a new SW raid.

What sequence of steps can I take to ensure that my raid is aligned on
a 64K boundary?
What effect do the different superblock formats have, if any, in this situation?


-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Bill Davidsen [EMAIL PROTECTED] wrote:
 As other posts have detailed, putting the partition on a 64k aligned
 boundary can address the performance problems. However, a poor choice of
 chunk size, cache_buffer size, or just random i/o in small sizes can eat
 up a lot of the benefit.

 I don't think you need to give up your partitions to get the benefit of
 alignment.

How might that benefit be realized?
Assume I have 3 disks, /dev/sd{b,c,d} all partitioned identically with
4 partitions, and I want to use /dev/sd{b,c,d}3 for a new SW raid.

What sequence of steps can I take to ensure that my raid is aligned on
a 64K boundary?
What effect do the different superblock formats have, if any, in this situation?

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Bill Davidsen

Justin Piszcz wrote:



On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the 
correct start and end size if I wanted to make sure the RAID5 was 
stripe aligned?


Or is there a better way to do this, does parted handle this 
situation better?


From that setup it seems simple, scrap the partition table and use 
the 
disk device for raid. This is what we do for all data storage disks 
(hw raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take 
my machine apart for a BIOS downgrade when I plugged in the sata 
devices again I did not plug them back in the same order, everything 
worked of course but when I ran LILO it said it was not part of the 
RAID set, because /dev/sda had become /dev/sdg and overwrote the MBR 
on the disk, if I had not used partitions here, I'd have lost (or 
more of the drives) due to a bad LILO run?


As other posts have detailed, putting the partition on a 64k aligned 
boundary can address the performance problems. However, a poor choice 
of chunk size, cache_buffer size, or just random i/o in small sizes 
can eat up a lot of the benefit.


I don't think you need to give up your partitions to get the benefit 
of alignment.


--
Bill Davidsen [EMAIL PROTECTED]
Woe unto the statesman who makes war without a reason that will still
be valid when the war is over... Otto von Bismark


Hrmm..

I am doing a benchmark now with:

6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.

unligned, just fdisk /dev/sdc, mkpartition, fd raid.
 aligned, fdisk, expert, start at 512 as the off-set

Per a Microsoft KB:

Example of alignment calculations in kilobytes for a 256-KB stripe 
unit size:

(63 * .5) / 256 = 0.123046875
(64 * .5) / 256 = 0.125
(128 * .5) / 256 = 0.25
(256 * .5) / 256 = 0.5
(512 * .5) / 256 = 1
These examples shows that the partition is not aligned correctly for a 
256-KB stripe unit size until the partition is created by using an 
offset of 512 sectors (512 bytes per sector).


So I should start at 512 for a 256k chunk size.

I ran bonnie++ three consecutive times and took the average for the 
unaligned, rebuilding the RAID5 now and then I will re-execute the 
test 3 additional times and take the average of that.


I'm going to try another approach, I'll describe it when I get results 
(or not).


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Bill Davidsen wrote:


Justin Piszcz wrote:



On Wed, 19 Dec 2007, Mattias Wadenstein wrote:


On Wed, 19 Dec 2007, Justin Piszcz wrote:


--

Now to my setup / question:

# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

  Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid 
autodetect


---

If I use 10-disk RAID5 with 1024 KiB stripe, what would be the correct 
start and end size if I wanted to make sure the RAID5 was stripe 
aligned?


Or is there a better way to do this, does parted handle this situation 
better?


From that setup it seems simple, scrap the partition table and use the 
disk device for raid. This is what we do for all data storage disks (hw 
raid) and sw raid members.


/Mattias Wadenstein



Is there any downside to doing that?  I remember when I had to take my 
machine apart for a BIOS downgrade when I plugged in the sata devices 
again I did not plug them back in the same order, everything worked of 
course but when I ran LILO it said it was not part of the RAID set, 
because /dev/sda had become /dev/sdg and overwrote the MBR on the disk, 
if I had not used partitions here, I'd have lost (or more of the drives) 
due to a bad LILO run?


As other posts have detailed, putting the partition on a 64k aligned 
boundary can address the performance problems. However, a poor choice of 
chunk size, cache_buffer size, or just random i/o in small sizes can eat 
up a lot of the benefit.


I don't think you need to give up your partitions to get the benefit of 
alignment.


--
Bill Davidsen [EMAIL PROTECTED]
Woe unto the statesman who makes war without a reason that will still
be valid when the war is over... Otto von Bismark


Hrmm..

I am doing a benchmark now with:

6 x 400GB (SATA) / 256 KiB stripe with unaligned vs. aligned raid setup.

unligned, just fdisk /dev/sdc, mkpartition, fd raid.
 aligned, fdisk, expert, start at 512 as the off-set

Per a Microsoft KB:

Example of alignment calculations in kilobytes for a 256-KB stripe unit 
size:

(63 * .5) / 256 = 0.123046875
(64 * .5) / 256 = 0.125
(128 * .5) / 256 = 0.25
(256 * .5) / 256 = 0.5
(512 * .5) / 256 = 1
These examples shows that the partition is not aligned correctly for a 
256-KB stripe unit size until the partition is created by using an offset 
of 512 sectors (512 bytes per sector).


So I should start at 512 for a 256k chunk size.

I ran bonnie++ three consecutive times and took the average for the 
unaligned, rebuilding the RAID5 now and then I will re-execute the test 3 
additional times and take the average of that.


I'm going to try another approach, I'll describe it when I get results (or 
not).


Waiting for the raid to rebuild then I will re-run thereafter.

  [=...]  recovery = 86.7% (339104640/390708480) 
finish=30.8min speed=27835K/sec


...


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Bill Davidsen wrote:

I'm going to try another approach, I'll describe it when I get results (or 
not).


http://home.comcast.net/~jpiszcz/align_vs_noalign/

Hardly any difference at whatsoever, only on the per char for read/write 
is it any faster..?


Average of 3 runs taken:

$ cat align/*log|grep ,
p63,8G,57683,94,86479,13,55242,8,63495,98,147647,11,434.8,0,16:10:16/64,1334210,10,330,2,120,1,3978,10,312,2
p63,8G,57973,95,76702,11,50830,7,62291,99,136477,10,388.3,0,16:10:16/64,1252548,6,296,1,115,1,7927,20,373,2
p63,8G,57758,95,80847,12,52144,8,63874,98,144747,11,443.4,0,16:10:16/64,1242445,6,303,1,117,1,6767,17,359,2

$ cat noalign/*log|grep ,
p63,8G,57641,94,85494,12,55669,8,63802,98,146925,11,434.8,0,16:10:16/64,1353180,8,314,1,117,1,8684,22,283,2
p63,8G,57705,94,85929,12,56708,8,63855,99,143437,11,436.2,0,16:10:16/64,12211519,29,297,1,113,1,3218,8,325,2
p63,8G,57783,94,78226,11,48580,7,63487,98,137721,10,438.7,0,16:10:16/64,1243229,8,307,1,120,1,4247,11,313,2

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Robin Hill
On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:

 The (up to) 30% percent figure is mentioned here:
 http://insights.oetiker.ch/linux/raidoptimization.html

That looks to be referring to partitioning a RAID device - this'll only
apply to hardware RAID or partitionable software RAID, not to the normal
use case.  When you're creating an array out of standard partitions then
you know the array stripe size will align with the disks (there's no way
it cannot), and you can set the filesystem stripe size to align as well
(XFS will do this automatically).

I've actually done tests on this with hardware RAID to try to find the
correct partition offset, but wasn't able to see any difference (using
bonnie++ and moving the partition start by one sector at a time).

 # fdisk -l /dev/sdc

 Disk /dev/sdc: 150.0 GB, 150039945216 bytes
 255 heads, 63 sectors/track, 18241 cylinders
 Units = cylinders of 16065 * 512 = 8225280 bytes
 Disk identifier: 0x5667c24a

Device Boot  Start End  Blocks   Id  System
 /dev/sdc1   1   18241   146520801   fd  Linux raid 
 autodetect

This looks to be a normal disk - the partition offsets shouldn't be
relevant here (barring any knowledge of the actual physical disk layout
anyway, and block remapping may well make that rather irrelevant).

That's my take on this one anyway.

Cheers,
Robin
-- 
 ___
( ' } |   Robin Hill[EMAIL PROTECTED] |
   / / )  | Little Jim says |
  // !!   |  He fallen in de water !! |


pgpF38P14XDRA.pgp
Description: PGP signature


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Justin Piszcz



On Wed, 19 Dec 2007, Robin Hill wrote:


On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote:


The (up to) 30% percent figure is mentioned here:
http://insights.oetiker.ch/linux/raidoptimization.html


That looks to be referring to partitioning a RAID device - this'll only
apply to hardware RAID or partitionable software RAID, not to the normal
use case.  When you're creating an array out of standard partitions then
you know the array stripe size will align with the disks (there's no way
it cannot), and you can set the filesystem stripe size to align as well
(XFS will do this automatically).

I've actually done tests on this with hardware RAID to try to find the
correct partition offset, but wasn't able to see any difference (using
bonnie++ and moving the partition start by one sector at a time).


# fdisk -l /dev/sdc

Disk /dev/sdc: 150.0 GB, 150039945216 bytes
255 heads, 63 sectors/track, 18241 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x5667c24a

   Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   18241   146520801   fd  Linux raid
autodetect


This looks to be a normal disk - the partition offsets shouldn't be
relevant here (barring any knowledge of the actual physical disk layout
anyway, and block remapping may well make that rather irrelevant).

That's my take on this one anyway.

Cheers,
   Robin
--
___
   ( ' } |   Robin Hill[EMAIL PROTECTED] |
  / / )  | Little Jim says |
 // !!   |  He fallen in de water !! |



Interesting, yes, I am using XFS as well, thanks for the response.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Michal Soltys

Justin Piszcz wrote:


Or is there a better way to do this, does parted handle this situation 
better?


What is the best (and correct) way to calculate stripe-alignment on the 
RAID5 device itself?



Does this also apply to Linux/SW RAID5?  Or are there any caveats that 
are not taken into account since it is based in SW vs. HW?


---


In case of SW or HW raid, when you place raid aware filesystem directly on 
it, I don't see any potential poblems


Also, if md's superblock version/placement actually mattered, it'd be pretty 
strange. The space available for actual use - be it partitions or filesystem 
directly - should be always nicely aligned. I don't know that for sure though.


If you use SW partitionable raid, or HW raid with partitions, then you would
have to align it on a chunk boundary manually. Any selfrespecting os 
shouldn't complain a partition doesn't start on cylinder boundary these 
days. LVM can complicate life a bit too - if you want it's volumes to be 
chunk-aligned.


With NTFS the problem is, that it's not aware of underlaying raid in any 
way. It starts with 16 sectors long boot sector, somewhat compatible with 
ancient FAT. My blind guess would be to try to align the very first sector 
of $Mft with your chunk. Also, mentioned bootsector is also referenced as 
$Boot, thus I don't know if large cluster won't automatically extend it to 
full cluster size. Experiment, YMMV :)



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Michal Soltys [EMAIL PROTECTED] wrote:
 Justin Piszcz wrote:
 
  Or is there a better way to do this, does parted handle this situation
  better?
 
  What is the best (and correct) way to calculate stripe-alignment on the
  RAID5 device itself?
 
 
  Does this also apply to Linux/SW RAID5?  Or are there any caveats that
  are not taken into account since it is based in SW vs. HW?
 
  ---

 In case of SW or HW raid, when you place raid aware filesystem directly on
 it, I don't see any potential poblems

 Also, if md's superblock version/placement actually mattered, it'd be pretty
 strange. The space available for actual use - be it partitions or filesystem
 directly - should be always nicely aligned. I don't know that for sure though.

 If you use SW partitionable raid, or HW raid with partitions, then you would
 have to align it on a chunk boundary manually. Any selfrespecting os
 shouldn't complain a partition doesn't start on cylinder boundary these
 days. LVM can complicate life a bit too - if you want it's volumes to be
 chunk-aligned.

That, for me, is the next question - how can one educate LVM about the
underlying block device such that logical volumes carved out of that
space align properly - many of us have experienced 30% (or so)
performance losses for the convenience of LVM (and mighty convenient
it is).


-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html