Re: RAID10 far (f2) read throughput on random and sequential / read-ahead

2008-02-23 Thread Keld Jørn Simonsen
I made a reference to your work in the wiki howto on performance.
Thanks!

Keld

On Fri, Feb 22, 2008 at 04:14:05AM +, Nat Makarevitch wrote:
 'md' performs wonderfully. Thanks to every contributor!
 
 I pitted it against a 3ware 9650 and 'md' won on nearly every account (albeit 
 on
 RAID5 for sequential I/O the 3ware is a distant winner):
 http://www.makarevitch.org/rant/raid/#3wmd
 
 On RAID10 f2 a small read-ahead reduces the throughput on sequential read, but
 even a low value (768 for the whole 'md' block device, 0 for the underlying
 spindles) enables very good sequential read performance (300 MB/s on 6 low-end
 Hitachi 500 GB spindles).
 
 What baffles me is that, on a 1.4TB array served by a box having 12 GB RAM 
 (low
 cache-hit ratio), the random access performance remains stable and high (450
 IOPS with 48 threads, 20% writes - 10% fsync'ed), even with a fairly high
 read-ahead (16k). How comes?!
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: suns raid-z / zfs

2008-02-18 Thread Keld Jørn Simonsen
On Mon, Feb 18, 2008 at 09:51:15PM +1100, Neil Brown wrote:
 On Monday February 18, [EMAIL PROTECTED] wrote:
  On Mon, Feb 18, 2008 at 03:07:44PM +1100, Neil Brown wrote:
   On Sunday February 17, [EMAIL PROTECTED] wrote:
Hi

   
It seems like a good way to avoid the performance problems of raid-5
/raid-6
   
   I think there are better ways.
  
  Interesting! What do you have in mind?
 
 A Log Structured Filesystem always does large contiguous writes.
 Aligning these to the raid5 stripes wouldn't be too hard and then you
 would never have to do any pre-reading.
 
  
  and what are the problems with zfs?
 
 Recovery after a failed drive would not be an easy operation, and I
 cannot imagine it being even close to the raw speed of the device.

I thought this was a problem with most raid types, while
reconstructioning, performance is quite slow. And as there has been some
damage, this is expected. And there probebly is no much ado about it.

Or is there? Are there any RAID types that performs reasonably well
given that one disk is under repair? The performance could be cruical
for some applications. 

One could think of clever arrangements so that say two disks could go
down and the rest of the array with 10-20 drives could still function
reasonably well, even under the reconstruction. As far as I can tell
from the code, the reconstruction itself is not impeding normal
performance much, as normal operation bars reconstuction operations.

Hmm, my understanding would then be, for both random reads and writes
that performance in typical raids would only be reduced by the IO bandwidth
of the failing disks.

For sequential R/W performance for raid10,f would
be hurt, downgrading its performance to random IO for the drives involved.

Raid5/6 would be hurt much for reading, as all drives need to be read for giving
correct information during reconstruction.


So it looks like, if your performance is important under a
reconstruction, then you should avoid raid5/6 and use the mirrored raid
types. Given you have a big operation, with a load balance of a lot of
random reading and writing, it does not matter much which mirrored
raid type you would choose, as they all perform about equal for random
IO, even when reconstructing. Is that correct advice?

  

But does it stripe? One could think that rewriting stripes
other places would damage the striping effects.
   
   I'm not sure what you mean exactly.  But I suspect your concerns here
   are unjustified.
  
  More precisely. I understand that zfs always write the data anew.
  That would mean at other blocks on the partitions, for the logical blocks
  of the file in question. So the blocks on the partitions will not be
  adjacant. And striping will not be possible, generally.
 
 The important part of striping is that a write is spread out over
 multiple devices, isn't it.
 
 If ZFS can choose where to put each block that it writes, it can
 easily choose to write a series of blocks to a collection of different
 devices, thus getting the major benefit of striping.

I see 2 major benefits of striping: one is that many drives are involved 
and the other is that the stripes are  allocated adjacant, so that io
on one drive can just proceed to the next physical blocks when one
stripe has been processed. Dependent on the size of the IO operations
involved, first one or more disks in a stripe is processed, and then the
following stripes are processed. ZFS misses the second part of the
optimization, In think.

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


suns raid-z / zfs

2008-02-17 Thread Keld Jørn Simonsen
Hi

any opinions on suns zfs/raid-z?
It seems like a good way to avoid the performance problems of raid-5
/raid-6

But does it stripe? One could think that rewriting stripes
other places would damage the striping effects.

Or is the performance only meant to be good for random read/write?

Can the code be lifted to Linux? I understand that it is already in
freebsd. Does Suns licence prevent this?

And could something like this be built into existing file systems like
ext3 and xfs? They could have a multipartition layer in their code, and
then the heuristics to optimize block access could also apply to stripe
access.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


patch for raid10,f1 to operate like raid0

2008-02-12 Thread Keld Jørn Simonsen
This patch changes the disk to be read for layout far  1 to always be
the disk with the lowest block address.

Thus the chunks to be read will always be (for a fully functioning array)
from the first band of stripes, and the raid will then work as a raid0
consisting of the first band of stripes.

Some advantages:

The fastest part which is the outer sectors of the disks involved will be used.
The outer blocks of a disk may be as much as 100 % faster than the inner blocks.

Average seek time will be smaller, as seeks will always be confined to the
first part of the disks.

Mixed disks with different performance characteristics will work better,
as they will work as raid0, the sequential read rate will be number of
disks involved times the IO rate of the slowest disk.

If a disk is malfunctioning, the first disk which is working, and has the lowest
block address for the logical block will be used.

Signed-off-by: Keld Simonsen [EMAIL PROTECTED]

--- raid10.c2008-02-12 00:50:59.0 +0100
+++ raid10-ks.c 2008-02-12 00:51:09.0 +0100
@@ -537,7 +537,7 @@
current_distance = abs(r10_bio-devs[slot].addr -
   conf-mirrors[disk].head_position);
 
-   /* Find the disk whose head is closest */
+   /* Find the disk whose head is closest, 
+  or for far  1 the closest to partition beginning */
 
for (nslot = slot; nslot  conf-copies; nslot++) {
int ndisk = r10_bio-devs[nslot].devnum;
@@ -557,7 +557,11 @@
slot = nslot;
break;
}
-   new_distance = abs(r10_bio-devs[nslot].addr -
+
+/* for far  1 always use the lowest address */
+   if (conf-far_copies  1) 
+   new_distance = r10_bio-devs[nslot].addr;
+   else new_distance = abs(r10_bio-devs[nslot].addr -
   conf-mirrors[ndisk].head_position);
if (new_distance  current_distance) {
current_distance = new_distance;
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


my io testing scripts

2008-02-11 Thread Keld Jørn Simonsen
Here are my testing scripts used in the performance howto:
http://linux-raid.osdl.org/index.php/Home_grown_testing_methods

=Hard disk performance scripts=
Here are the scripts that I used for my performance measuring. Use at your own 
risk.
They destroy the contents of the partitions involved. The /dev/md raid needs to
be stopped before initiating the test.

Copyright Keld Simonsen, [EMAIL PROTECTED] 2008. Licensed under the GPL.

iotest:
   
   #!/bin/sh
   # invoked by
   # iotest mdadm -R -C /dev/md1 --chunk=256 -l 10 -n 2 -p f2   /dev/md1 
/mnt/md1 ext3  /dev/hdb5 /dev/hdd5 
   echo \n $1 $5 \n  /tmp/results
   echo $1 $5
   $1 $5
   mkfs -t $4 $2
   mkdir $3
   mount $2 $3
   cd $3
   echo \nmakefiles\n  /tmp/results
   mkfiles 200
   echo \n remakefiles \n /tmp/results
   mkfiles 200
   echo \n catall \n /tmp/results
   cat * /dev/null
   echo \n catnull \n /tmp/results
   catnull
   cd
   umount $2
   mdadm -S $2
   echo \n finish  $1   $5 \n   /tmp/results
   
Be careful with this script, and remember to change the ordinary test to only 
one partition
   
iorun:   
   #!/bin/sh
   # set up ram disk
   DISKS=/dev/sda2 /dev/sdb2
   iostat -k 10  /tmp/results 
   iotest /dev/sda2 
/mnt/sda2 ext3  
   iotest mdadm -C /dev/md1 --chunk=256 -R -l  0 -n 2   /dev/md1 /mnt/md1 
ext3 $DISKS
   iotest mdadm -C /dev/md1 --chunk=256 -R -l  1 -n 2   /dev/md1 /mnt/md1 
ext3 $DISKS
   iotest mdadm -C /dev/md1 --chunk=256 -R -l 10 -n 2   /dev/md1 /mnt/md1 
ext3 $DISKS
   iotest mdadm -C /dev/md1 --chunk=256 -R -l 10 -n 2 -p f2  /dev/md1 
/mnt/md1 ext3 $DISKS
   # iotest mdadm -C /dev/md1 --chunk=256 -R -l 10 -n 2 -p o2  /dev/md1 
/mnt/md1 ext3 $DISKS
   
mkfiles:   
   #!/bin/sh
   for (( i = 1; i  $1 ; i++ )) ; do dd if=/dev/hda1 of=$i bs=1MB count=40 ; 
done
   for (( i = 1; i  $1 ; i++ )) ; do dd if=/dev/hda1 of=$i bs=1MB count=40  ; 
done
   
catnull:   
   #!/bin/tcsh
   foreach i ( * ) 
cat $i /dev/null 
   end
   wait
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


howto on performance

2008-02-11 Thread Keld Jørn Simonsen
I have put up a new howto text on performance:
http://linux-raid.osdl.org/index.php/Performance#Performance_of_raids_with_2_disks

Enjoy!
Keld

=Performance of raids with 2 disks=

I have made some testing of performance of different types of RAIDs,
with 2 disks involved. I have used my own home grown testing methods,
which are quite simple, to test sequential and random reading and writing of 
200 files of 40 MB. The tests were meant to see what performance I could 
get out of a system mostly oriented towards file serving, such as a mirror site.

My configuration was

1800 MHz AMD Sempron(tm) Processor 3100+
1500 MB RAM
2 x  Hitachi Ultrastar SCSI-II 1 TB.
Linux version 2.6.12-26mdk

Figures are in MB/s, and the file system was ext3. Times were measured with 
iostat,
and an estimate for steady performance was taken. The times varied quite 
a lot over the different 10 second intervals, for example the estimate 155 MB/s
ranged from 135 MB/s to 163 MB/s. I then looked at the avearge over the period 
when
a test was running in full scale (all processes started, and none stopped).

RAID type  sequential read random readsequential write   random 
write
Ordinary disk   82 34 67
56
RAID0  155 80 97
80
RAID1   80 35 72
55
RAID10  79 56 69
48
RAID10,f2  150 79 70
55

Random read for RAID1 and RAID10 were quite unbalanced, almost only coming out 
of one of the disks.

The results are quite as expected:

RAID0 and RAID10,f2 reads are double speed compared to ordinary file system for 
sequential reads
(155 vs 82) and more than double for random reads (80 vs 35).

Writes (both sequential and random) are roughly the same for ordinary disk, 
RAID1, RAID10 and
RAID10,f2, around 70 MB/s for sequential, and 55 MB/s for random.

Sequential reads are about the same (80 MB/s) for ordinary partition, RAID1 and 
RAID10.

Rndom reads for ordinary partition and RAID1 is about the same (35 MB/s) and 
about 50 % higher for
RAID10. I am puzzled why RAID10 is faster than RAID1 here.

All in all RAID10,f2 is the fastest mirrored RAID for both sequential and 
random reading for this test,
while it is about equal with the other mirrored RAIDs when writing.

My kernel did not allow me to test RAID10,o2 as this is only supported from 
kernel 2.6.18.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: howto and faq

2008-02-10 Thread Keld Jørn Simonsen
On Sun, Feb 10, 2008 at 10:05:13AM +, David Greaves wrote:
 Keld Jørn Simonsen wrote:
 
  The list description at
  http://vger.kernel.org/vger-lists.html#linux-raid
  does list af FAQ, http://www.linuxdoc.org/FAQ/
 Yes, that should be amended. Drop them a line about the FAQ too

I will.

  So our FAQ info is pretty out of date. I think it would be nice to have
  a wiki like we have for the Howto. This would mean that we have much
  better means to let new people make their mark, and avoid the problem
  that we have today with really outdated info.
 
 There seems to be no point in having separate wikis for the FAQ and HOWTO
 elements of documentation. Especially since a lot of FAQs are How do I... by
 definition the answer is a HOWTO.
 
  So can we put up a wiki somewhere for this, or should we just extend the
  wiki howto pages to also include a faq section?
 So just extend the existing wiki.

OK, so let's have a combined howto and faq.

I would then like that to be reflected in the main page.
I would rather that this be called Howto and FAQ - Linux raid
than Main Page - Linux Raid. Is that possible?

And then, how do we structure the pages? I think we need a new section
for the FAQ. 

And then I would like a clearer statement on the relation between the
linux-raid mailing list and the pages, right in the top of the main page.

 I set the wiki up at osdl to ensure that if a bus hit me then Neil or others
 would have a rational and responsive organisation to go to to change 
 ownership.
 
 I've been writing to some of the other FAQ/Doc organisations sporadically for
 over a year now and had no response from any of them. It's a very poor aspect 
 of
 OSS...

Looks like a good move.

I have had a look at other search engines, yahoo and msn.
Our pages do show up within the 10 first hits for linux raid.
So that is not that bad. Still, Google has the
http://linux-raid.osdl.org/ page as number 127. That is very bad.
Maybe something about it being referenced from wikipedia?

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: howto and faq

2008-02-10 Thread Keld Jørn Simonsen
On Sun, Feb 10, 2008 at 06:21:08PM +, David Greaves wrote:
 Keld Jørn Simonsen wrote:
  I would then like that to be reflected in the main page.
  I would rather that this be called Howto and FAQ - Linux raid
  than Main Page - Linux Raid. Is that possible?
 
 Just like C has a main() wiki's have a Main Page :)
 
 I guess it could be changed but I think it involves editing the Mediawiki 
 config
 - maybe next time I'm in there...

OK, good.

 
  And then, how do we structure the pages? I think we need a new section
  for the FAQ.
 
 By all means create an FAQ page and link to answers or other relevant sections
 of the wiki. Bear in mind that this is a reference work and whilst it may
 contain tutorials the idea is that it contains (reasonably) authoritative
 information about the linux raid subsystem (linking to the source, kernel docs
 or man pages if that's more appropriate).

Yes, I will be conservative and robust in what I write there.

  And then I would like a clearer statement on the relation between the
  linux-raid mailing list and the pages, right in the top of the main page.
 The relationship is loose - the statement as it stands describes the current
 state of affairs. If Neil feels that he could or would like to help the case 
 by
 declaring a more official relationship then that's his call. To be fair I work
 on these pages on and off as the mood takes me :) if I was Neil I'd be keeping
 an eye on it and waiting for the right level of community involvement.

OK, I will only state something like the usual FAQ thing: please consult
the FAQ before submitting questions to the list.

  I have had a look at other search engines, yahoo and msn.
  Our pages do show up within the 10 first hits for linux raid.
  So that is not that bad. Still, Google has the
  http://linux-raid.osdl.org/ page as number 127. That is very bad.
  Maybe something about it being referenced from wikipedia?
 
 I'm not an expert at gaming the search engines - more than happy to do 
 rational
 things like linking from Wikipedia and other reference sites.
 
 I am sad that I've had such a poor response from the other linux documentation
 sites... maybe a Slashdot article not so much about doc-rot but about the
 difficulty of combating doc-rot would help...
 
 Maybe they'd take more notice if I said the linux raid subsystem maintainer
 says... - dunno.

I think we should just contact some more people...
And then do some linking ourselves.

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: two writing algorithms

2008-02-08 Thread Keld Jørn Simonsen
On Fri, Feb 08, 2008 at 12:51:39PM +1100, Neil Brown wrote:
 On Friday February 8, [EMAIL PROTECTED] wrote:
  On Fri, Feb 08, 2008 at 07:25:31AM +1100, Neil Brown wrote:
   On Thursday February 7, [EMAIL PROTECTED] wrote:
  
So I hereby give the idea for inspiration to kernel hackers.
   
   and I hereby invite you to read the code ;-)
  
  I did some reading.  Is there somewhere a description of it, especially
  the raid code, or are the comments and the code the best documentation?
 
 No.  If a description was written (and various people have tried to
 describe various parts) it would be out of date within a few months :-(

OK, I was under the impression that some of the code did not change
much. Eg. you said that there had not been any work on optimizing
raid10 for performance since the 2.6.12 kernel I was using. And then at
least the raid5 code, the last copyright notice right in the top is 
Copyright (C) 2002, 2003 H. Peter Anvin. That is 5 years ago. 
And your name is not on it. So I did not look that much into that code,
thinking nothing had been done there for ages. Maybe you could add your
name on it, that would only be fair. The same comment goes for other
modules (for which it is relevant).

 Look for READ_MODIFY_WRITE and RECONSTRUCT_WRITE  no.  That
 only applied to raid6 code now..
 Look instead for the 'rcw' and 'rmw' counters, and then at
 'handle_write_operations5'  which does different things based on the
 'rcw' variable.
 
 It used to be a lot clearer before we implemented xor-offload.  The
 xor-offload stuff is good, but it does make the code more complex.

OK, I think it is fairly well documented here, I can at least follow the
logic, and then I think it is a good approach to have the flow
description/strategy included directly in the code. Given there are many
changes to the code, different files for code and description could
easily mix up the alignment of code and documentation badly.

 
 
  
  Do you say that this is already implemented?
 
 Yes.

That is very good!


Do you konw if other implementations of this, eg. commercial controller
code, have this facility? If not, we could list this as an advantage of 
linux raid. Anyway it would be implicit in performance documentation.
I do plan to write up something on performance, soonish. The howto
is hopelessly outdated.

IMHO such code should make the performance of raid5 random writes not
that bad. Better than the reputation that raid5 is hopelessly slow for
database writing. I think raid5 would be less than double as slow as
raid1 for random writing.

  Well, I do have a hack in mind, on the raid10,f2.
  I need to investigate some more, and possibly test out
  what really happens. But maybe the code already does what I want it to.
  You are possibly the one that knows the code best, so maybe you can tell
  me if raid10,f2 always does its reading in the first part of the disks?
 
 Yes, I know the code best.
 
 No, raid10,f2 doesn't always use the first part of the disk.  Getting
 it to do that would be a fairly small change in 'read_balance' in
 md/raid10.c.
 
 I'm not at all convinced that the read balancing code in raid10 (or
 raid1) really does the best thing.  So any improvements - backed up
 with broad testing - would be most welcome.

I think I know where to do my proposed changes, and how it could be done.
So maybe in a not too distant future I will have done my first kernel
hack!

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: draft howto on making raids for surviving a disk crash

2008-02-07 Thread Keld Jørn Simonsen
On Thu, Feb 07, 2008 at 09:05:04AM +0100, Luca Berra wrote:
 On Wed, Feb 06, 2008 at 04:45:39PM +0100, Keld Jørn Simonsen wrote:
 On Wed, Feb 06, 2008 at 10:05:58AM +0100, Luca Berra wrote:
 On Sat, Feb 02, 2008 at 08:41:31PM +0100, Keld Jørn Simonsen wrote:
 Make each of the disks bootable by lilo:
 
   lilo -b /dev/sda /etc/lilo.conf1
   lilo -b /dev/sdb /etc/lilo.conf2
 There should be no need for that.
 to achieve the above effect with lilo you use
 raid-extra-boot=mbr-only
 in lilo.conf
 
 Make each of the disks bootable by grub
 install grub with the command
 grub-install /dev/md0
 
 I have already changed the text on the wiki. Still I am not convinced it 
 is the best advice that is described.
 
 lilo -b /dev/md0 (without a raid-extra-boot line in lilo.conf) will
 install lilo on the boot sector of the partitions containing /dev/md0
 (and it will break with 1.1 sb)

I think 1.1 Superblocks will break all boots with lilo and grub,
but 1.1 superblocks are not standard in current distributions.

When would 1.1 superblocks be a problem, for new users of raid?

 for grub, do you have any doubt about the grub-install script not
 working correctly?

No, I think the grub description is OK. I only meant the lilo
description.

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-07 Thread Keld Jørn Simonsen
On Thu, Feb 07, 2008 at 06:40:12AM +0100, Iustin Pop wrote:
 On Thu, Feb 07, 2008 at 01:31:16AM +0100, Keld Jørn Simonsen wrote:
  Anyway, why does a SATA-II drive not deliver something like 300 MB/s?
 
 Wait, are you talking about a *single* drive?

Yes, I was talking about a single drive.

 In that case, it seems you are confusing the interface speed (300MB/s)
 with the mechanical read speed (80MB/s).

I thought the 300 MB/S was the transfer rate between the disk and the
controllers memory in its buffers, but you indicate that this is the
speed between the controller's buffers and main RAM. 

I am, as Neil, amazed by the speeds that we get on current hardware, but
still I would like to see if we could use the hardware better.
Asyncroneous IO could be a way forward. I have written some mainframe
utilities where asyncroneous IO was the key to the performance,
so I thought that it could also become handy in the Linux kernel. 

If about 80 MB/s is the maximum we can get out of a current SATA-II
7200 rpm drive, then I think there is not much to be gained from
asyncroneous IO. 

 If you are asking why is a
 single drive limited to 80 MB/s, I guess it's a problem of mechanics.
 Even with NCQ or big readahead settings, ~80-~100 MB/s is the highest
 I've seen on 7200 RPM drives. And yes, there is no wait until the CPU
 processes the current data until the drive reads the next data; drives
 have a builtin read-ahead mechanism.
 
 Honestly, I have 10x as many problems with the low random I/O throughput
 rather than with the (high, IMHO) sequential I/O speed.

I agree that random IO is the main factor on most server installations.
But on workstations the sequentioal IO is also important, as the only
user is sometimes waiting for the computer to respond.
And then I think that booting can benefit from faster sequential IO.
And not to forget, I think it is fun to make my hardware run faster!

best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


howto and faq

2008-02-07 Thread Keld Jørn Simonsen
Hi 

I am trying to get some order to linux raid info.

I think we should have a faq and a howto for the linux-raid list.

The list description at
http://vger.kernel.org/vger-lists.html#linux-raid
does list af FAQ, http://www.linuxdoc.org/FAQ/
I cannot read it just now - the server 
www.linuxdoc.org does not respond.
I then tried the google archive - which had no info, and then the 
internet archive, which for the latest entry on this had some notes on
Debian and GFDL - quite irrelevant.

there are other FAQs that claim to be the FAQ for 
linux-raid. One is http://www.faqs.org/contrib/linux-raid/
which is quite extensive, but from 2003 (about 5 years old).

So our FAQ info is pretty out of date. I think it would be nice to have
a wiki like we have for the Howto. This would mean that we have much
better means to let new people make their mark, and avoid the problem
that we have today with really outdated info.

So can we put up a wiki somewhere for this, or should we just extend the
wiki howto pages to also include a faq section?

For the howto, I have asked the VGER people to add info to our list
description, that we have a wiki howto at http://linux-raid.osdl.org/

I believe that this is the fact, that this howto is our official howto.
I have added a remark in the top of the text hinting that this wiki
howto is the official howto of the linux-raid list, tho I did not  state
it as such. 

Hope this gives some clarity of the situation.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid5: two writing algorithms

2008-02-07 Thread Keld Jørn Simonsen
As I understand it, there are 2 valid algoritms for writing in raid5.

1. calculate the parity data by XOR'ing all data of the relevant data
chunks.

2. calculate the parity data by kind of XOR-subtracting the old data to
be changed, and then XOR-adding the new data. (XOR-subtract and XOR-add
is actually the same).

There are situations where method 1 is the fastest, and situations where
method 2 is the fastest.

My idea is then that the raid5 code in the kernel can calculate which
method is the faster. 

method 1 is faster, if all data is already available. I understand that
this method is employed in the current kernel. This would eg be the case
with sequential writes.

Method 2 is faster, if no data is available in core. It would require
2 reads and two writes, which always will be faster than n reads and 1
write, possibly except for n=2. method 2 is thus faster normally for
random writes.

I think that method 2 is not used in the kernel today. Mayby I am wrong,
but I did have a look in the kernel code.

So I hereby give the idea for inspiration to kernel hackers.

Yoyr kernel hacker wannabe
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: two writing algorithms

2008-02-07 Thread Keld Jørn Simonsen
On Fri, Feb 08, 2008 at 07:25:31AM +1100, Neil Brown wrote:
 On Thursday February 7, [EMAIL PROTECTED] wrote:
  As I understand it, there are 2 valid algoritms for writing in raid5.
  
  1. calculate the parity data by XOR'ing all data of the relevant data
  chunks.
  
  2. calculate the parity data by kind of XOR-subtracting the old data to
  be changed, and then XOR-adding the new data. (XOR-subtract and XOR-add
  is actually the same).
  
  There are situations where method 1 is the fastest, and situations where
  method 2 is the fastest.
  
  My idea is then that the raid5 code in the kernel can calculate which
  method is the faster. 
  
  method 1 is faster, if all data is already available. I understand that
  this method is employed in the current kernel. This would eg be the case
  with sequential writes.
  
  Method 2 is faster, if no data is available in core. It would require
  2 reads and two writes, which always will be faster than n reads and 1
  write, possibly except for n=2. method 2 is thus faster normally for
  random writes.
  
  I think that method 2 is not used in the kernel today. Mayby I am wrong,
  but I did have a look in the kernel code.
 
 It is very odd that you would think something about the behaviour of
 the kernel with actually having looked.
 
 It also seems a little arrogant to have a clever idea and assume that
 no one else has thought of it before.

Oh well, I have to admit that I do not understand the code fully.
I am not a seasoned kernel hacker, as I also indicated in my ad hoc
signature.

  So I hereby give the idea for inspiration to kernel hackers.
 
 and I hereby invite you to read the code ;-)

I did some reading.  Is there somewhere a description of it, especially
the raid code, or are the comments and the code the best documentation?

Do you say that this is already implemented?

I am sorry if you think I am mailing too much on the list.
But I happen to think it is fun.
And I do try to give something back.

 Code reading is a good first step to being a
  
  Yoyr kernel hacker wannabe
^
 
 NeilBrown

Well, I do have a hack in mind, on the raid10,f2.
I need to investigate some more, and possibly test out
what really happens. But maybe the code already does what I want it to.
You are possibly the one that knows the code best, so maybe you can tell
me if raid10,f2 always does its reading in the first part of the disks?

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Purpose of Document? (was Re: draft howto on making raids for surviving a disk crash)

2008-02-06 Thread Keld Jørn Simonsen
On Wed, Feb 06, 2008 at 08:24:37AM -0600, Moshe Yudkowsky wrote:
 I read through the document, and I've signed up for a Wiki account so I 
 can edit it.
 
 One of the things I wanted to do was correct the title. I see that there 
 are *three* different Wiki pages about how to build a system that boots 
 from RAID. None of them are complete yet.
 
 So, what is the purpose of this page? I think the purpose is a complete 
 description of how to use RAID to build a system that not only boots 
 from RAID but is robust against other hazards such as file system 
 corruption.

You are right about that there are more than one wiki page addressing
very related issues. I also considered whether there was a need for the
new page, and discussed it with David.

And yes, my idea was to make a howto on building a system that can
survive a disk crash. A simple system that can also work for a
workstation. In fact the main audience is possibly here.

so my focus is: survive a failing disk, and keep it simple.

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: draft howto on making raids for surviving a disk crash

2008-02-06 Thread Keld Jørn Simonsen
On Wed, Feb 06, 2008 at 10:05:58AM +0100, Luca Berra wrote:
 On Sat, Feb 02, 2008 at 08:41:31PM +0100, Keld Jørn Simonsen wrote:
 Make each of the disks bootable by lilo:
 
   lilo -b /dev/sda /etc/lilo.conf1
   lilo -b /dev/sdb /etc/lilo.conf2
 There should be no need for that.
 to achieve the above effect with lilo you use
 raid-extra-boot=mbr-only
 in lilo.conf
 
 Make each of the disks bootable by grub
 install grub with the command
 grub-install /dev/md0

I have already changed the text on the wiki. Still I am not convinced it 
is the best advice that is described.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 or raid10 for /boot

2008-02-06 Thread Keld Jørn Simonsen
On Wed, Feb 06, 2008 at 01:52:11PM -0500, Bill Davidsen wrote:
 Keld Jørn Simonsen wrote:
 I understand that lilo and grub only can boot partitions that look like
 a normal single-drive partition. And then I understand that a plain
 raid10 has a layout which is equivalent to raid1. Can such a raid10
 partition be used with grub or lilo for booting?
 And would there be any advantages in this, for example better disk
 utilization in the raid10 driver compared with raid?
   
 
 I don't know about you, but my /boot goes with zero use between boots, 
 efficiency and performance improvements strike as a distinction without 
 a difference, while adding complexity without benefit is always a bad idea.
 
 I suggest that you avoid having a learning experience and stick with 
 raid1.

I agree with you, it was only a theoretical question.

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Keld Jørn Simonsen
On Wed, Feb 06, 2008 at 09:25:36PM +0100, Wolfgang Denk wrote:
 In message [EMAIL PROTECTED] you wrote:
 
   I actually  think the kernel should operate with block sizes
   like this and not wth 4 kiB blocks. It is the readahead and the elevator
   algorithms that save us from randomly reading 4 kb a time.
  
 
  Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk.
 
 Indeed kernel page size is an important factor in such optimizations.
 But you have to keep in mind that this is mostly efficient for (very)
 large strictly sequential I/O operations only -  actual  file  system
 traffic may be *very* different.
 
 We implemented the option to select kernel page sizes of  4,  16,  64
 and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
 graphics of the effect can be found here:
 
 https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

Yes, that is also what I would expect, for sequential reads.
Random writes of small data blocks, kind of what is done in bug data
bases, should show another picture as others also have described.

If you look at a single disk, would you get improved performance with
the asyncroneous IO?

I am a bit puzzled about my SATA-II performance: nominally I could get
300 MB/s on SATA-II, but I only get about 80 MB/s. Why is that?
I thought it was because of latency with syncroneous reads.
Ie, when a chunk is read, yo need to complete the IO operation, and then
issue an new one. In the meantime while the CPU is doing these
calculations, te disk has spun a little, and to get the next data chunk,
we need to wait for the disk to spin around to have the head positioned 
over the right data pace on the disk surface. Is that so? Or does the
controller take care of this, reading the rest of the not-yet-requested
track into a buffer, which then can be delivered next time. Modern disks
often have buffers of about 8 or 16 MB. I wonder why they don't have
bigger buffers.

Anyway, why does a SATA-II drive not deliver something like 300 MB/s?

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Keld Jørn Simonsen
On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote:
 On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote:
  Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 
  +0100)
  
 
 All the raid10's will have double time for writing, and raid5 and raid6
 will also have double or triple writing times, given that you can do
 striped writes on the raid0. 

For raid5 and raid6 I think this is even worse. My take is that for
raid5 when you write something, you first read the chunk data involved,
then you read the parity data, then you xor-subtract the data to be
changed, and you xor-add the new data, and then write the new data chunk
and the new parity chunk. In total 2 reads and 2 writes. The read/writes
happen on the same chunks, so latency is minimized. But in essence it is
still 4 IO operations, where it is only 2 writes on raid1/raid10,
that is only half the speed for writing on raid5 compared to raid1/10.

On raid6 this amounts to 6 IO operations, resulting in 1/3 of the
writing speed of raid1/10.

I note in passing that there is no difference between xor-subtract and
xor-add. 

Also I assume that you can calculate the parities of both raid5 and
raid6 given the old parities chunks and the old and new data chunk.
If you have to calculate the new parities by reading all the component
data chunks this is going to be really expensive, both in IO and CPU.
For a 10 drive raid5 this would involve reading 9 data chunks, and
making writes 5 times as expensive as raid1/10.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


recommendations for stripe/chunk size

2008-02-05 Thread Keld Jørn Simonsen
Hi

I am looking at revising our howto. I see a number of places where a
chunk size of 32 kiB is recommended, and even recommendations on
maybe using sizes of 4 kiB. 

My own take on that is that this really hurts performance. 
Normal disks have a rotation speed of between 5400 (laptop)
7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average
spinning time for one round of 6 to 12 ms, and average latency of half
this, that is 3 to 6 ms. Then you need to add head movement which
is something like 2 to 20 ms - in total average seek time 5 to 26 ms,
averaging around 13-17 ms. 

in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 
something like between 600 to 1200 kB, actual transfer rates of
80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck,
and transfer some data you should have something like 256/512 kiB
chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB
giving about a time of 20 ms per transaction
you should be able with random reads to transfer 12 MB/s  - my
actual figures is about 30 MB/s which is possibly because of the
elevator effect of the file system driver. With a size of 4 kb per chunk 
you should have a time of 15 ms per transaction, or 66 transactions per 
second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up
the transfer by a factor of 50. 

I actually  think the kernel should operate with block sizes
like this and not wth 4 kiB blocks. It is the readahead and the elevator
algorithms that save us from randomly reading 4 kb a time.

I also see that there are some memory constrints on this.
Having maybe 1000 processes reading, as for my mirror service,
256 kib buffers would be acceptable, occupying 256 MB RAM.
That is reasonable, and I could even tolerate 512 MB ram used.
But going to 1 MiB buffers would be overdoing it for my configuration.

What would be the recommended chunk size for todays equipment?

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Keld Jørn Simonsen
On Tue, Feb 05, 2008 at 11:54:27AM -0500, Justin Piszcz wrote:
 
 
 On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:
 
 On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote:
 On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote:
 Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 
 +0100)
 
 
 All the raid10's will have double time for writing, and raid5 and raid6
 will also have double or triple writing times, given that you can do
 striped writes on the raid0.
 
 For raid5 and raid6 I think this is even worse. My take is that for
 raid5 when you write something, you first read the chunk data involved,
 then you read the parity data, then you xor-subtract the data to be
 changed, and you xor-add the new data, and then write the new data chunk
 and the new parity chunk. In total 2 reads and 2 writes. The read/writes
 happen on the same chunks, so latency is minimized. But in essence it is
 still 4 IO operations, where it is only 2 writes on raid1/raid10,
 that is only half the speed for writing on raid5 compared to raid1/10.
 
 On raid6 this amounts to 6 IO operations, resulting in 1/3 of the
 writing speed of raid1/10.
 
 I note in passing that there is no difference between xor-subtract and
 xor-add.
 
 Also I assume that you can calculate the parities of both raid5 and
 raid6 given the old parities chunks and the old and new data chunk.
 If you have to calculate the new parities by reading all the component
 data chunks this is going to be really expensive, both in IO and CPU.
 For a 10 drive raid5 this would involve reading 9 data chunks, and
 making writes 5 times as expensive as raid1/10.
 
 best regards
 keld
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 On my benchmarks RAID5 gave the best overall speed with 10 raptors, 
 although I did not play with the various offsets/etc as much as I have 
 tweaked the RAID5.

Could you give some figures? 

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Keld Jørn Simonsen
On Tue, Feb 05, 2008 at 05:28:27PM -0500, Justin Piszcz wrote:
 
 
 Could you give some figures?
 
 I remember testing with bonnie++ and raid10 was about half the speed 
 (200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, but input 
 was closer to RAID5 speeds/did not seem affected (~550MiB/s).

Impressive. What levet of raid10 was involved? and what type of
equipment, how many disks? Maybe the better output for raid5 could be
due to some striping - AFAIK raid5 will be striping quite well, and 
writes almost equal to reading time indicates that the writes are
striping too.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 or raid10 for /boot

2008-02-04 Thread Keld Jørn Simonsen
On Mon, Feb 04, 2008 at 09:17:35AM +, Robin Hill wrote:
 On Mon Feb 04, 2008 at 07:34:54AM +0100, Keld Jørn Simonsen wrote:
 
  I understand that lilo and grub only can boot partitions that look like
  a normal single-drive partition. And then I understand that a plain
  raid10 has a layout which is equivalent to raid1. Can such a raid10
  partition be used with grub or lilo for booting?
  And would there be any advantages in this, for example better disk
  utilization in the raid10 driver compared with raid?
  
 A plain RAID-10 does _not_ have a layout equivalent to RAID-1 and
 _cannot_ be used for booting (well, possibly a 2-disk RAID-10 could -
 I'm not sure how that'd be layed out).  RAID-10 uses striping as well as
 mirroring, and the striping breaks both grub and lilo (and, AFAIK, every
 other boot manager currently out there).

Yes, it is understood that raid10,f2 uses striping, but a raid10,near=2,
far=1 does not use striping, anfd this is what you get if you just make
amdadm --create /dev/md0 -l 10 -n 2 /dev/sda1 /dev/sdb1

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: draft howto on making raids for surviving a disk crash

2008-02-03 Thread Keld Jørn Simonsen
On Sun, Feb 03, 2008 at 10:53:51AM -0500, Bill Davidsen wrote:
 Keld Jørn Simonsen wrote:
 This is intended for the linux raid howto. Please give comments.
 It is not fully ready /keld
 
 Howto prepare for a failing disk
 
 6. /etc/mdadm.conf
 
 Something here on /etc/mdadm.conf. What would be safe, allowing
 a system to boot even if a disk has crashed?
   
 
 Recommend PARTITIONS by used

Thanks Bill for your suggestions, which I have incorporated in the text.

However, I do not understand what to do with the remark above.
Please explain.

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 and raid 10 always writes all data to all disks?

2008-02-03 Thread Keld Jørn Simonsen
On Sun, Feb 03, 2008 at 10:56:01AM -0500, Bill Davidsen wrote:
 Keld Jørn Simonsen wrote:
  I found a sentence in the HOWTO:
 
 raid1 and raid 10 always writes all data to all disks
 
 I think this is wrong for raid10.
 
 eg
 
 a raid10,f2 of 4 disks only writes to two of the disks -
 not all 4 disks. Is that true?
   
 
 I suspect that really should have read all mirror copies, in the 
 raid10 case.

OK, I changed the text to:

raid1 always writes all data to all disks.

raid10 always writes all data to the number of copies that the raid holds.
For example on a raid10,f2 or raid10,o2 of 6 disks, the data will only
be written 2 times.

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid1 or raid10 for /boot

2008-02-03 Thread Keld Jørn Simonsen
I understand that lilo and grub only can boot partitions that look like
a normal single-drive partition. And then I understand that a plain
raid10 has a layout which is equivalent to raid1. Can such a raid10
partition be used with grub or lilo for booting?
And would there be any advantages in this, for example better disk
utilization in the raid10 driver compared with raid?

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


draft howto on making raids for surviving a disk crash

2008-02-02 Thread Keld Jørn Simonsen
This is intended for the linux raid howto. Please give comments.
It is not fully ready /keld

Howto prepare for a failing disk

The following will describe how to prepare a system to survive
if one disk fails. This can be important for a server which is
intended to always run. The description is mostly aimed at
small servers, but it can also be used for
work stations to protect it for not losing data, and be running even if a 
disk fails. Some recommendations on larger server setup is given
at the end of the howto.

This requires some extra hardware, especially disks, and the description 
will also touch how to mak the most out of the disks, be it in terms of
available disk space, or input/output speed.

1. Creating of partitions

We recommend creating partitions for /boot, root, swap and other file systems.
This can be done by fdisk, parted or maybe a graphical interface
like the Mandriva/PClinuxos harddrake2.  It is recommended to use drives
with equal sizes and performance characteristics.

If we are using the 2 drives sda and sdb, then sfdisk
may be used to make all the partitions into raid partitions:

   sfdisk -c /dev/sda 1 fd
   sfdisk -c /dev/sda 2 fd
   sfdisk -c /dev/sda 3 fd
   sfdisk -c /dev/sda 5 fd
   sfdisk -c /dev/sdb 1 fd
   sfdisk -c /dev/sdb 2 fd
   sfdisk -c /dev/sdb 3 fd
   sfdisk -c /dev/sdb 5 fd

Using:

   fdisk -l /dev/sda /dev/sdb

The partition layout could then look like this:

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot  Start End  Blocks   Id  System
/dev/sda1   1  37  297171   fd  Linux raid autodetect
/dev/sda2  381132 8795587+  fd  Linux raid autodetect
/dev/sda311331619 3911827+  fd  Linux raid autodetect
/dev/sda41620  121601   9637554155  Extended
/dev/sda51620  121601   963755383+  fd  Linux raid autodetect

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot  Start End  Blocks   Id  System
/dev/sdb1   1  37  297171   fd  Linux raid autodetect
/dev/sdb2  381132 8795587+  fd  Linux raid autodetect
/dev/sdb311331619 3911827+  fd  Linux raid autodetect
/dev/sdb41620  121601   9637554155  Extended
/dev/sdb51620  121601   963755383+  fd  Linux raid autodetect



2. Prepare for boot

The system should be set up to boot from multiple devices, so that
if one disk fails, the system can boot from another disk.

On Intel hardware, there are two common boot loaders, grub and lilo.
Both grub and lilo can only boot off a raid1. they cannot boot off
any other software raid device type. The reason they can boot off
the raid1 is that hey see the raid1 as a normal disk, they only then use
one of the dishs when booting. The boot stage only involves loading the kernel
with a initrd image, so not much data is needed for this. The kernel,
the initrd and other boot files can be put in a small /boot partition.
We recommend something like 200 MB on an ext3 raid1.

Make the raid1 and ext3 filesystem:

   mdadm --create /dev/md0 --chunk=256 -R -l 1 -n 2 /dev/sda1 /dev/sdb1
   mkfs -t ext3 -f /dev/md0

Make each of the disks bootable by lilo:

   lilo -b /dev/sda /etc/lilo.conf1
   lilo -b /dev/sdb /etc/lilo.conf2

Make each of the disks bootable by grub

(to be described)

3. The root file system

The root file system can be on another raid tah the /boot partition.
We recommend an raid10,f2, as the root file system will mostly be reads, and
the raid10,f2 raid type is the fastest for reads, while also sufficient 
fast for writes. Other relevant raid types would be raid10,o2 or raid1.

It is recommended to use the udev file system, as this runs in RAM, and you
thus can avoid a number of read and writes to disk.

It is recommended that all file systems are mounted with the noatime option, 
this 
avoids writing to the filesystem inodes every time a file has been read or 
written.

Make the raid10,f2 and ext3 filesystem:

   mdadm --create /dev/md1 --chunk=256 -R -l 10 -n 2 -p f2 /dev/sda2 /dev/sdb2
   mkfs -t ext3 -f /dev/md1


4. The swap file system

If a disk fails, where processes are swapped to, then all these processes fail.
This may be vital processes for the system, or vital jobs on the system. You 
can prevent 
the failing of the processes by having the swap partitions on a raid. The swap 
area
needed is normally relatively small compared to the overall disk space 
available,
so we recommend the faster raid types over the more space economic. The 
raid10,f2
type seems to be the fastest here, other relevant raid types could be raid10,o2 
or raid1.

Given that you have created a raid array, you can just make the swap partition 

Re: draft howto on making raids for surviving a disk crash

2008-02-02 Thread Keld Jørn Simonsen
On Sat, Feb 02, 2008 at 09:32:54PM +0100, Janek Kozicki wrote:
 Keld Jørn Simonsen said: (by the date of Sat, 2 Feb 2008 20:41:31 +0100)
 
  This is intended for the linux raid howto. Please give comments.
  It is not fully ready /keld
 
 very nice. do you intend to put it on http://linux-raid.osdl.org/ 

Yes, that is the intention.

 As wiki, it will be much easier for our community to fix errors and
 add updates.

Agreed. But I will not put it up before I am sure it is reasonably
flawless, ie. it will at least work. I found myself a few errors already.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 1 and grub

2008-02-02 Thread Keld Jørn Simonsen
On Wed, Jan 30, 2008 at 06:47:19PM -0800, David Rees wrote:
 On Jan 30, 2008 6:33 PM, Richard Scobie [EMAIL PROTECTED] wrote:
 
 FWIW, this step is clearly marked in the Software-RAID HOWTO under
 Booting on RAID:
 http://tldp.org/HOWTO/Software-RAID-HOWTO-7.html#ss7.3

A good an extesive reference, but somewhat outdated.

 BTW, I suspect you are missing the command setup from your 3rd
 command above, it should be:
 
 # grub
 grub device (hd0) /dev/hdc
 grub root (hd0,0)
 grub setup (hd0)

I do not grasp this. How and where is it said that two disks are
involved? hda and hdc should both be involved.

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid1 and raid 10 always writes all data to all disks?

2008-02-02 Thread Keld Jørn Simonsen
 I found a sentence in the HOWTO:

raid1 and raid 10 always writes all data to all disks

I think this is wrong for raid10.

eg

a raid10,f2 of 4 disks only writes to two of the disks -
not all 4 disks. Is that true?

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-30 Thread Keld Jørn Simonsen
On Wed, Jan 30, 2008 at 03:47:30PM +0100, Peter Rabbitson wrote:
 Michael Tokarev wrote:
 
 With 5-drive linux raid10:
 
A  B  C  D  E
0  0  1  1  2
2  3  3  4  4
5  5  6  6  7
7  8  8  9  9
   10 10 11 11 12
...
 
 AB can't be removed - 0, 5.  AC CAN be removed, as
 are AD.  But not AE - losing 2 and 7.  And so on.

I see. Does the kernel code allow this? And mdadm?

And can B+E be removed safely, and C+E and B+D? 

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-01-30 Thread Keld Jørn Simonsen
On Wed, Jan 30, 2008 at 07:21:33PM +0100, Janek Kozicki wrote:
 Hello,
 
 Yes, I know that some levels give faster reading and slower writing, etc.
 
 I want to talk here about a typical workstation usage: compiling
 stuff (like kernel), editing openoffice docs, browsing web, reading
 email (email: I have a webdir format, and in boost mailing list
 directory I have 14000 files (posts), opening this directory takes
 circa 10 seconds in sylpheed). Moreover, opening .pdf files, more
 compiling of C++ stuff, etc...
 
 I have a remote backup system configured (with rsnapshot), which does
 backups two times a day. So I'm not afraid to lose all my data due to
 disc failure. I want absolute speed.
 
 Currently I have Raid-0, because I was thinking that this one is
 fastest. But I also don't need twice the capacity. I could use Raid-1
 as well, if it was faster.
 
 Due to recent discussion about Raid-10,f2 I'm getting worried that
 Raid-0 is not the fastest solution, but instead a Raid-10,f2 is
 faster.
 
 So how really is it, which level gives maximum overall speed?
 
 
 I would like to make a benchmark, but currently, technically, I'm not
 able to. I'll be able to do it next month, and then - as a result of
 this discussion - I will switch to other level and post here
 benchmark results.
 
 How does overall performance change with the number of available drives?
 
 Perhaps Raid-0 is best for 2 drives, while Raid-10 is best for 3, 4
 and more drives?

Teoretically, raid0 and raid10,f2 should be the same for reading, given the
same size of the md partition, etc. For writing, raid10,f2 should be half the 
speed of
raid0. This should go both for sequential and random read/writes.
But I would like to have real test numbers. 

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-01-30 Thread Keld Jørn Simonsen
On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote:
 Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 +0100)
 
  Teoretically, raid0 and raid10,f2 should be the same for reading, given the
  same size of the md partition, etc. For writing, raid10,f2 should be half 
  the speed of
  raid0. This should go both for sequential and random read/writes.
  But I would like to have real test numbers. 
 
 Me too. Thanks. Are there any other raid levels that may count here?
 Raid-10 with some other options?

Given that you want maximum thruput for both reading and writing, I
think there is only one way to go, that is raid0.

All the raid10's will have double time for writing, and raid5 and raid6
will also have double or triple writing times, given that you can do
striped writes on the raid0. 

For random and sequential writing in the normal case (no faulty disks) I would
guess that all of the raid10's, the raid1 and raid5 are about equally fast, 
given the
same amount of hardware.  (raid5, raid6 a little slower given the
unactive parity chunks).

For random reading, raid0, raid1, raid10 should be equally fast, with
raid5 a little slower, due to one of the disks virtually out of
operation, as it is used for the XOR parity chunks. raid6 should be 
somewhat slower due to 2 non-operationable disks. raid10,f2 may have a
slight edge due to virtually only using half the disk giving better
average seek time, and using the faster outer disk halves.

For sequential reading, raid0 and raid10,f2 should be equally fast.
Possibly raid10,o2 comes quite close. My guess is that raid5 then is
next, achieving striping rates, but with the loss of one parity drive,
and then raid1 and raid10,n2 with equal performance.

In degraded mode, I guess for random read/writes the difference is not
big between any of the raid1, raid5 and raid10 layouts, while sequential
reads will be especially bad for raid10,f2 approaching the random read
rate, and others will enjoy the normal speed of the above filesystem
(ext3, reiserfs, xfs etc).

Theory, theory theory. Show me some real figures.

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 06:13:41PM +0300, Michael Tokarev wrote:
 
 Linux raid10 MODULE (which implements that standard raid10
 LEVEL in full) adds some quite.. unusual extensions to that
 standard raid10 LEVEL.  The resulting layout is also called
 raid10 in linux (ie, not giving new names), but it's not that
 raid10 (which is again the same as raid1+0) as commonly known
 in various literature and on the internet.  Yet raid10 module
 fully implements STANDARD raid10 LEVEL.

My understanding is that you can have a linux raid10 of only 2
drives, while the standard RAID 1+0 requires 4 drives, so this is a huge
difference.

I am not sure what vanilla linux raid10 (near=2, far=1)
has of properties. I think it can run with only 1 disk, but I think it
does not have striping capabilities. It would be nice to have more 
info on this, eg in the man page. 

Is there an official web page for mdadm?
And maybe the raid faq could be updated?

best regards
keld 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 05:07:27PM +0300, Michael Tokarev wrote:
 Peter Rabbitson wrote:
  Moshe Yudkowsky wrote:
 
 
  It is exactly what the names implies - a new kind of RAID :) The setup
  you describe is not RAID10 it is RAID1+0.
 
 Raid10 IS RAID1+0 ;)
 It's just that linux raid10 driver can utilize more.. interesting ways
 to lay out the data.

My understandining is that raid10 is different from RAID1+0

Traditional  RAID1+0 is composed of two RAID1's combined into one RAID0.
It takes 4 drives to make it work. Linux raid10 only takes 2 drives to
work.

Traditional RAID1+0 only have one way of laying out the blocks. 

raid10 have a number of ways to do layout, namely the near, far and
offset ways, layout=n2, f2, o2 respectively.

Traditional RAID1+0 can only do striping of half of the disks involved,
while raid10 can do striping on all disks in the far and offset layouts.



I looked around on the net for documentation of this. The first hits (on
Google) for mkadm did not have descriptions of raid10. Wikipedia
describes raid 10 as a synonym for raid1+0. I think there is too much
confusion on the raid10 term, and that also the marveleous linux raid10
layouts is a little known secret beyound maybe the circles of this
linux-raid list. We should tell others more about the wondersi of raid10.

And then I would like a good reference for describing how raid10,o2
works and why bigger chunks work. 

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 05:02:57AM -0600, Moshe Yudkowsky wrote:
 Neil, thanks for writing. A couple of follow-up questions to you and the 
 group:
 
 If the answers above don't lead to a resolution, I can create two RAID1 
 pairs and join them using LVM. I would take a hit by using LVM to tie 
 the pairs intead of RAID0, I suppose, but I would avoid the performance 
 hit of multiple md drives on a single physical drive, and I could even 
 run a hot spare through a sparing group. Any comments on the performance 
 hit -- is raid1L a really bad idea for some reason?

You can of cause construct a traditional raid-1+0 in Linux as you describe here,
but this is different from linux raid10 (with its different layout
possibilities). And constructing two grub/lilos on two disks for a raid1
on /boot seems to be the right way for a reasonably secured system.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 09:57:48AM -0600, Moshe Yudkowsky wrote:
 
 In my 4 drive system, I'm clearly not getting 1+0's ability to use grub 
 out of the RAID10.  I expect it's because I used 1.2 superblocks (why 
 not use the latest, I said, foolishly...) and therefore the RAID10 -- 
 with even number of drives -- can't be read by grub. If you'd patch that 
 information into the man pages that'd be very useful indeed.

If you have 4 drives, I think the right thing is to use a raid1 with 4
drives, for your /boot partition. Then yo can survive that 3 disks
crash!


If you want the extra performance, then I think you should not bother
too much for the kernel and initrd load time - which of cause is not
striping on the disks, but some performance improvement can be expected.
Then you can have the rest of /root on a raid10,f2 with 4 disks.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 07:51:07PM +0300, Michael Tokarev wrote:
 Peter Rabbitson wrote:
 []
  However if you want to be so anal about names and specifications: md
  raid 10 is not a _full_ 1+0 implementation. Consider the textbook
  scenario with 4 drives:
  
  (A mirroring B) striped with (C mirroring D)
  
  When only drives A and C are present, md raid 10 with near offset will
  not start, whereas standard RAID 1+0 is expected to keep clunking away.
 
 Ugh.  Yes. offset is linux extension.
 
 But md raid 10 with default, n2 (without offset), configuration will behave
 exactly like in classic docs.

I would like to understand this fully. What Peter described for mdraid10:
 md raid 10 with near offset  I believe is vanilla raid10 without any
options (or near=2, far=1). Will that not start if we are unlucky to
have 2 drives failing, but we are lucky that the data on the two
remaining drives actually have all the data?

Same question for a raid10,f2 array. I think it would be easy to
investigate, when the number of drives are even, if all data is present,
and then happily run an array with some failing disks.

Say for a 4 drive raid10,f2 disks A and D are failing, then all data
should be present on drives B and C, given that A and C have the even
chunks, and B and D have the odd chunks. Likewise for a 6 drive array,
etc for all multiples of 2, with F2.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 07:46:58PM +0300, Michael Tokarev wrote:
 Keld Jørn Simonsen wrote:
  On Tue, Jan 29, 2008 at 06:13:41PM +0300, Michael Tokarev wrote:
  Linux raid10 MODULE (which implements that standard raid10
  LEVEL in full) adds some quite.. unusual extensions to that
  standard raid10 LEVEL.  The resulting layout is also called
  raid10 in linux (ie, not giving new names), but it's not that
  raid10 (which is again the same as raid1+0) as commonly known
  in various literature and on the internet.  Yet raid10 module
  fully implements STANDARD raid10 LEVEL.
  
  My understanding is that you can have a linux raid10 of only 2
  drives, while the standard RAID 1+0 requires 4 drives, so this is a huge
  difference.
 
 Ugh.  2-drive raid10 is effectively just a raid1.  I.e, mirroring
 without any striping. (Or, backwards, striping without mirroring).

OK.  

uhm, well, I did not understand: (Or, backwards, striping without
mirroring).  I don't think a 2 drive vanilla raid10 will do striping.
Please explain.

 Pretty much like with raid5 of 2 disks - it's the same as raid1.

I think in raid5 of 2 disks, half of the chunks are parity chynks which
are evenly distributed over the two disks, and the parity chunk is the
XOR of the data chunk. But maybe I am wrong. Also the behaviour of suce
a raid5 is different from a raid1 as the parity chunk is not used as
data.
 
  I am not sure what vanilla linux raid10 (near=2, far=1)
  has of properties. I think it can run with only 1 disk, but I think it
 
 number of copies should be = number of disks, so no.

I have a clear understanding that in a vanilla linux raid10 (near=2, far=1)
you can run with one failing disk, that is with only one working disk.
Am I wrong?

  does not have striping capabilities. It would be nice to have more 
  info on this, eg in the man page. 
 
 It's all in there really.  See md(4).  Maybe it's not that
 verbose, but it's not a user's guide (as in: a large book),
 after all.

Some man pages have examples. Or info could be written in the faq or in
wikipedia.

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


linux raid faq

2008-01-29 Thread Keld Jørn Simonsen
Hmm, I read the Linux raid faq on
http://www.faqs.org/contrib/linux-raid/x37.html

It looks pretty outdated, referring to how to patch 2.2 kernels and
not mentioning new mdadm, nor raid10. It was not dated. 
It seemed to be related to the linux-raid list, telling where to find
archives of the list.

Maybe time for an update? or is this not the right place to write stuff?

If I searched on google for raid faq, the first say 5-7 items did not
mention raid10. 

Maybe wikipedia is the way to go? I did contribute myself a little
there.

The software raid howto is dated v. 1.1 3rd of June 2004,
http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO.html
also pretty old.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 01:34:37PM -0600, Moshe Yudkowsky wrote:
 
 I'm going to convert back to the RAID 1 setup I had before for /boot, 2 
 hot and 2 spare across four drives. No, that's wrong: 4 hot makes the 
 most sense.
 
 And given that RAID 10 doesn't seem to confer (for me, as far as I can 
 tell) advantages in speed or reliability -- or the ability to mount just 
 one surviving disk of a mirrored pair -- over RAID 5, I think I'll 
 convert back to RAID 5, put in a hot spare, and do regular backups (as 
 always). Oh, and use reiserfs with data=journal.

Hmm, my idea was to use a raid10,f2 4 disk raid for the /root, or a o2
layout. I think it would offer quite some speed advantage over raid5. 
At least I had on a 4 disk raid5 only a random performance of about 130
MB/s while the raid10 gave 180-200 MB/s. Also sequential read was
significantly faster on raid10. I do think I can get about 320 MB/s 
on the raid10,f2, but I need to have a bigger power supply to support my
disks before I can go on testing. The key here is bigger readahead.
I only got 150 MB/s for raid5 sequential reads. 

I think the sequential read could be significant in the boot time,
and then for the single user running on the system, namely the system
administrator (=me), even under reasonable load.

I would be interested if you would experiment with this wrt boot time,
for example the difference between /root on a raid5, raid10,f2 and raid10,o2.



 Comments back:
 
 Mr. Tokarev wrote:
 
 By the way, on all our systems I use small (256Mb for small-software 
 systems,
 sometimes 512M, but 1G should be sufficient) partition for a root 
 filesystem
 (/etc, /bin, /sbin, /lib, and /boot), and put it on a raid1 on all...
 ... doing [it]
 this way, you always have all the tools necessary to repair a damaged 
 system
 even in case your raid didn't start, or you forgot where your root disk is
 etc etc.
 
 An excellent idea. I was going to put just /boot on the RAID 1, but 
 there's no reason why I can't add a bit more room and put them all 
 there. (Because I was having so much fun on the install, I'm using 4GB 
 that I was going to use for swap space to mount base install and I'm 
 working from their to build the RAID. Same idea.)

If you put more than /boot on the raid1, then you will not get the added
performance of raid10 for all your system utilities. 

I am not sure about redundance, but a raid1 and a raid10 should be
equally vulnerable to a 1 disk faliure. If you use a 4 disk raid1 for 
/root, then of cause you can survive 3 disk crashes.

I am not sure that 4 disks in a raid1 for /root give added performance, 
as grub only sees the /root raid1 as a normal disk, but maybe some kind of
remounting makes it get its raid behaviour.


 Also, placing /dev on a tmpfs helps alot to minimize number of writes
 necessary for root fs.

I thought of using the noatime mount option for /root.

best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 04:14:24PM -0600, Moshe Yudkowsky wrote:
 Keld Jørn Simonsen wrote:
 
 Based on your reports of better performance on RAID10 -- which are more 
 significant that I'd expected -- I'll just go with RAID10. The only 
 question now is if LVM is worth the performance hit or not.

Hmm, LVM for what purpose? For the root system, I think it is not 
an issue. Just have a large enough partition, it is not more than 10- 20
GB anyway, which is around 1 % of the disk sizes that we talk about
today with new disks in raids.

 I would be interested if you would experiment with this wrt boot time,
 for example the difference between /root on a raid5, raid10,f2 and 
 raid10,o2.
 
 According to man md(4), the o2 is likely to offer the best combination 
 of read and write performance. Why would you consider f2 instead?

I have no experiences with o2, and little experience with f2.
But I kind of designed f2. I have not fully grasped o2 yet. 

But my take is that for writes, this would be random writes, and that is
almost the same for all layouts. However, when/if a disk is faulty, then 
f2 has considerably worse performance for sequential reads,
approximating the performance of random reads, which in some cases is
about half the speed of sequential reads. For sequential reads and
random reads I think f2 would be faster than o2, due to the smaller 
average seek times, and use of the faster part of the disk.

I am still wondering how o2 gets to do striping, I don't understand it
given the layout schemes I have seen. F2 OTOH is designed for striping.

I would like to see some figures, tho. My testing environment is, as
said, not operationable right now, but will be OK possibly later this
week.

 I'm unlike to do any testing beyond running bonnie++ or something 
 similar once it's installed.

I do some crude testing like reading concurrently 1000 files of 20 MB, 
and then just cat file /dev/null of a 4 GB file. The RAM caches needs
to be not capable of holding the files.

Looking on boot times could also be interesting. I would like as litte
downtime as possible.

But it depends on your purpose and thus pattern of use. Many systems
tend to be read oriented, and for that I think f2 is the better
alternative.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 06:32:54PM -0600, Moshe Yudkowsky wrote:
 
 Hmm, why would you put swap on a raid10? I would in a production
 environment always put it on separate swap partitions, possibly a number,
 given that a number of drives are available.
 
 In a production server, however, I'd use swap on RAID in order to 
 prevent server downtime if a disk fails -- a suddenly bad swap can 
 easily (will absolutely?) cause the server to crash (even though you can 
 boot the server up again afterwards on the surviving swap partitions).

I see. Which file system type would be good for this?
I normally use XFS but maybe other FS is better, given that swap is used
very randomly 8read/write).

Will a bad swap crash the system?

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: striping of a 4 drive raid10

2008-01-28 Thread Keld Jørn Simonsen
On Mon, Jan 28, 2008 at 01:32:48PM -0500, Bill Davidsen wrote:
 Neil Brown wrote:
 On Sunday January 27, [EMAIL PROTECTED] wrote:
   
 Hi
 
 I have tried to make a striping raid out of my new 4 x 1 TB
 SATA-2 disks. I tried raid10,f2 in several ways:
 
 1: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, md2 = raid0
 of md0+md1
 
 2: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, md2 = raid01,f2
 of md0+md1
 
 3: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, chunksize 
 of md0 =md1 =128 KB,  md2 = raid0 of md0+md1 chunksize = 256 KB
 
 4: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, chunksize
 of md0 = md1 = 128 KB, md2 = raid01,f2 of md0+md1 chunksize = 256 KB
 
 5: md0= raid10,f4 of sda1+sdb1+sdc1+sdd1
 
 
 Try
   6: md0 = raid10,f2 of sda1+sdb1+sdc1+sdd1
 
 Also try raid10,o2 with a largeish chunksize (256KB is probably big
 enough).
   
 
 Looking at the issues raised, there might be some benefit from having 
 the mirror chunks on the slower inner tracks of a raid10, and to read 
 from the outer tracks if the drives with the data on the outer tracks 
 are idle. This would appear to offer a transfer rate benefit overall.

Hmm, how do I do this? I think this is normal behaviour of a raid10,f2.
Is that so?

So you mean I should rather use f2 than o2? Or should I configure the f2
in some way?

My hdparm -t gives:

/dev/sda5:
 Timing buffered beginning disk reads:   82 MB in  1.00 seconds = 81.686 MB/sec
 Timing buffered endingdisk reads:   42 MB in  1.03 seconds = 40.625 MB/sec
 Average seek time 13.714 msec, min=4.641, max=23.921
 Average track-to-track time 28.151 msec, min=26.729, max=28.730

So, yes, there is a reason to use the faster outer tracks - and have the 
faster access time that f2 gives . How does o2 behave here? Does it read
and search on the whole disk?


As to your other comments in another mail, I could of cause install
a newer kernel and mdadm, but then I would loose the support of my
supported and paid system. And Neil said that there have been no
performance fixes for f2 since the kernel I use (2.6.12).
I thought that o2 support was included since 2.6.10 - but apparantly not
so. 

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


striping of a 4 drive raid10

2008-01-27 Thread Keld Jørn Simonsen
Hi

I have tried to make a striping raid out of my new 4 x 1 TB
SATA-2 disks. I tried raid10,f2 in several ways:

1: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, md2 = raid0
of md0+md1

2: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, md2 = raid01,f2
of md0+md1

3: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, chunksize of 
md0 =md1 =128 KB,  md2 = raid0 of md0+md1 chunksize = 256 KB

4: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, chunksize
of md0 = md1 = 128 KB, md2 = raid01,f2 of md0+md1 chunksize = 256 KB

5: md0= raid10,f4 of sda1+sdb1+sdc1+sdd1

My new disks give a transfer rate of about 80 MB/s, so I expected
to have something like 320 MB/s for the whole raid, but I did not get
more than about 180 MB/s.

I think it may be something with the layout, that in effect 
the drives should be something like:

  sda1 sdb1sdc1  sdd1
   01   2 3
   45   6 7

And this was not really doable for the combination of raids,
because thet combinations give different block layouts.

How can it be done? Do we need a new raid type?

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: striping of a 4 drive raid10

2008-01-27 Thread Keld Jørn Simonsen
On Mon, Jan 28, 2008 at 07:13:30AM +1100, Neil Brown wrote:
 On Sunday January 27, [EMAIL PROTECTED] wrote:
  Hi
  
  I have tried to make a striping raid out of my new 4 x 1 TB
  SATA-2 disks. I tried raid10,f2 in several ways:
  
  1: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, md2 = raid0
  of md0+md1
  
  2: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, md2 = raid01,f2
  of md0+md1
  
  3: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, chunksize of 
  md0 =md1 =128 KB,  md2 = raid0 of md0+md1 chunksize = 256 KB
  
  4: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, chunksize
  of md0 = md1 = 128 KB, md2 = raid01,f2 of md0+md1 chunksize = 256 KB
  
  5: md0= raid10,f4 of sda1+sdb1+sdc1+sdd1
 
 Try
   6: md0 = raid10,f2 of sda1+sdb1+sdc1+sdd1

That I already tried, (and I wrongly stated that I used f4 in stead of
f2). I had two times a thruput of about 300 MB/s but since then I could
not reproduce the behaviour. Are there errors on this that has been
corrected in newer kernels?


 Also try raid10,o2 with a largeish chunksize (256KB is probably big
 enough).

I tried that too, but my mdadm did not allow me to use the o flag.

My kernel is 2.6.12  and mdadm is v1.12.0 - 14 June 2005.
can I upgrade the mdadm alone to a newer version, and then which is
recommendable?

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: striping of a 4 drive raid10

2008-01-27 Thread Keld Jørn Simonsen
On Sun, Jan 27, 2008 at 08:11:35PM +, Peter Grandi wrote:
  On Sun, 27 Jan 2008 20:33:45 +0100, Keld Jørn Simonsen
  [EMAIL PROTECTED] said:
 
 keld Hi I have tried to make a striping raid out of my new 4 x
 keld 1 TB SATA-2 disks. I tried raid10,f2 in several ways:
 
 keld 1: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, md2 = 
 raid0
 keldof md0+md1
 keld 2: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, md2 = raid01,f2
 keldof md0+md1
 keld 3: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, chunksize 
 of 
 keldmd0 =md1 =128 KB,  md2 = raid0 of md0+md1 chunksize = 256 KB
 keld 4: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, chunksize
 keldof md0 = md1 = 128 KB, md2 = raid01,f2 of md0+md1 chunksize = 256 KB
 
 These stacked RAID levels don't make a lot of sense.
 
 keld 5: md0= raid10,f4 of sda1+sdb1+sdc1+sdd1
 
 This also does not make a lot of sense. Why have four mirrors
 instead of two?

My error, I did mean f2.

Anyway 4 mirrors would make the disk 2 times faster than 2 disks, and given disk
prices these days this could make a lot of sense.

 Instead, try 'md0 = raid10,f2' for example. The first mirror of
 will be striped across the outer half of all four drives, and
 the second mirrors will be rotated in the inner half of each
 drive.
 
 Which of course means that reads will be quite quick, but writes
 and degraded operation will be slower.
 
 Consider this post for more details:
 
   http://www.spinics.net/lists/raid/msg18130.html

Thanks for the reference.

There is also more in the original article on possible layouts of what
is now known as raid10,f2

http://marc.info/?l=linux-raidm=107427614604701w=2

including performance enhancements due to use of the faster outer
sectors, and smaller average seek times because you can seek on only
half the disk.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


hdparm patch with min/max transfer rate, and min/avg/max access times

2008-01-25 Thread Keld Jørn Simonsen
Hi

I have made some patches to hdparm to report min/max transfer rates,
and min/avg/max access times. Enjoy!

http://std.dkuug.dk/keld/hdparm-7.7-ks.tar.gz

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


performance of raid10,f2 on 4 disks

2008-01-23 Thread Keld Jørn Simonsen
Hi!

I have played around with raid10,f2 on a 2 disk array set,
and I really liked the performance on the sequential reads.
It looked like double up on the speed, about 173 MB/s
for two SATA-2 disks. 

I then went on to look at my 4 new SATS-2 disks, to have
the same kind of performance I made the array by:

mdadm --create /dev/md3 --chunk=256 -R -l 10 -n 4 -p f2 /dev/sd[abcd]1

And my first tests showed a sequential read rate of 320 MB/s.
Impressive! I then tried it a few more times, but then I could not
get more than around 160 MB/s, which is less than what I got on 2 disks.

Any ideas of what is going on?

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html