Re: How many drives are bad?

2008-02-21 Thread Peter Rabbitson

Peter Grandi wrote:

In general, I'd use RAID10 (http://WWW.BAARF.com/), RAID5 in


Interesting movement. What do you think is their stance on Raid Fix? :)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LVM performance

2008-02-19 Thread Peter Rabbitson

Oliver Martin wrote:
Interesting. I'm seeing a 20% performance drop too, with default RAID 
and LVM chunk sizes of 64K and 4M, respectively. Since 64K divides 4M 
evenly, I'd think there shouldn't be such a big performance penalty.


I am no expert, but as far as I have read you must not only have compatible 
chunk sizes (which is easy and most often the case). You also must stripe 
align the LVM chunks, so every chunk spans an even number of raid stripes (not 
raid chunks). Check the output of `dmsetup table`. The last number is the 
offset of the underlying block device at which the LVM data portion starts. It 
must be divisible by the raid stripe length (the length varies for different 
raid types).


Currently LVM does not offer an easy way to do such alignment, you have to do 
it manually upon executing pvcreate. By using the option --metadatasize one 
can specify the size of the area between the LVM header (64KiB) and the start 
of the data area. So one would supply STRIPE_SIZE - 64 for metadatasize[*], 
and the result will be a stripe aligned LVM.


This information is unverified, I just compiled it from different list threads 
and whatnot. I did this to my own arrays/volumes and I get near 100% raw 
speed. If someone else can confirm the validity of this - it would be great.


Peter

* The supplied number is always rounded up to be divisible by 64KiB, so the 
smallest total LVM header is at least 128KiB

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-06 Thread Peter Rabbitson

Janek Kozicki wrote:

writing on raid10 is supposed to be half the speed of reading. That's
because it must write to both mirrors.



I am not 100% certain about the following rules, but afaik any raid 
configuration has a theoretical[1] maximum read speed of the combined speed of 
all disks in the array and a maximum write speed equal to the combined speed 
of a disk-length of a stripe. By disk-length I mean how many disks are needed 
to reconstruct a single stripe - the rest of the writes are redundancy and are 
essentially non-accountable work. For raid5 it is N-1. For raid6 - N-2. For 
linux raid 10 it is N-C+1 where C is the number of chunk copies. So for -p n3 
-n 5 we would get a maximum write speed of 3 x single drive speed. For raid1 
the disk-length of a stripe is always 1.


So the statement

IMHO raid5 could perform good here, because in *continuous* write
operation the blocks from other HDDs were just have been written,
they stay in cache and can be used to calculate xor. So you could get
close to almost raid-0 performance here.
is quite incorrect. You will get close to raid-0 if you have many disks, but 
will never beat raid0, since once disk is always busy writing parity which is 
not part of the write request submitted to the mdX device in the first place.


[1] Theoretical since any external factors (busy CPU, unsuitable elevator, 
random disk access, multiple raid levels on one physical device) would all 
contribute to take you further away from the maximums.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread Peter Rabbitson

Marcin Krol wrote:

Tuesday 05 February 2008 21:12:32 Neil Brown napisał(a):


% mdadm --zero-superblock /dev/sdb1
mdadm: Couldn't open /dev/sdb1 for write - not zeroing

That's weird.
Why can't it open it?


Hell if I know. First time I see such a thing. 


Maybe you aren't running as root (The '%' prompt is suspicious).


I am running as root, the % prompt is the obfuscation part (I have
configured bash to display IP as part of prompt).


Maybe the kernel has  been told to forget about the partitions of
/dev/sdb.


But fdisk/cfdisk has no problem whatsoever finding the partitions .


mdadm will sometimes tell it to do that, but only if you try to
assemble arrays out of whole components.



If that is the problem, then
   blockdev --rereadpt /dev/sdb


I deleted LVM devices that were sitting on top of RAID and reinstalled mdadm.

% blockdev --rereadpt /dev/sdf
BLKRRPART: Device or resource busy

% mdadm /dev/md2 --fail /dev/sdf1
mdadm: set /dev/sdf1 faulty in /dev/md2

% blockdev --rereadpt /dev/sdf
BLKRRPART: Device or resource busy

% mdadm /dev/md2 --remove /dev/sdf1
mdadm: hot remove failed for /dev/sdf1: Device or resource busy

lsof /dev/sdf1 gives ZERO results.



What does this say:

dmsetup table

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 1 and grub

2008-02-03 Thread Peter Rabbitson

Bill Davidsen wrote:

Richard Scobie wrote:

A followup for the archives:

I found this document very useful:

http://lists.us.dell.com/pipermail/linux-poweredge/2003-July/008898.html

After modifying my grub.conf to refer to (hd0,0), reinstalling grub on 
hdc with:


grub device (hd0) /dev/hdc

grub root (hd0,0)

grub (hd0)

and rebooting with the bios set to boot off hdc, everything burst back 
into life.


I shall now be checking all my Fedora/Centos RAID1 installs for grub 
installed on both drives.


Have you actually tested this by removing the first hd and booting? 
Depending on the BIOS I believe that the fallback drive will be called 
hdc by the BIOS but will be hdd in the system. That was with RHEL3, but 
worth testing.




The line:

grub device (hd0) /dev/hdc

simply means treat /dev/hdc as the first _bios_ hard disk in the system. 
This way when grub writes to the MBR of hd0, it will be in fact writing to 
/dev/hdc. The reason the drive must be referenced as hd0 (and not hd2) is 
because grub enuerates drives according to the bios, and therefore the drive 
from which the bios is currently booting is _always_ hd0.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


WRONG INFO (was Re: In this partition scheme, grub does not find md information?)

2008-01-30 Thread Peter Rabbitson

Moshe Yudkowsky wrote:
over the other. For example, I've now learned that if I want to set up a 
RAID1 /boot, it must actually be 1.2 or grub won't be able to read it. 
(I would therefore argue that if the new version ever becomes default, 
then the default sub-version ought to be 1.2.)


In the discussion yesterday I myself made a serious typo, that should not 
spread. The only superblock version that will work with current GRUB is 1.0 
_not_ 1.2.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-30 Thread Peter Rabbitson

Peter Rabbitson wrote:

Moshe Yudkowsky wrote:
Here's a baseline question: if I create a RAID10 array using default 
settings, what do I get? I thought I was getting RAID1+0; am I really?


Maybe you are, depending on your settings, but this is beyond the point. 
No matter what 1+0 you have (linux, classic, or otherwise) you can not 
boot from it, as there is no way to see the underlying filesystem 
without the RAID layer.


With the current state of affairs (available mainstream bootloaders) the 
rule is:

Block devices containing the kernel/initrd image _must_ be either:
* a regular block device (/sda1, /hda, /fd0, etc.)
* or a linux RAID 1 with the superblock at the end of the device 
(0.9 or 1.2)





If any poor soul finds this in the mailing list archives, the above should read:

...
	* or a linux RAID 1 with the superblock at the end of the device (either 
version 0.9 or _1.0_)


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help, big error, dd first GB of a raid:-/

2008-01-30 Thread Peter Rabbitson

Lars Schimmer wrote:


Hi!

Due to a very bad idea/error, I zeroed my first GB of /dev/md0.
Now fdisk doesn't find any disk on /dev/md0.
Any idea on how to recover?



It largely depends on what is /dev/md0, and what was on /dev/md0. Provide very 
detailed info:


* Was the MD device partitioned?
* What filesystem(s) were residing on the array, what sizes, what order
* What was each filesystem used for (mounted as what)

Someone might be able to help at this point, however if you do not have a 
backup - you are in very very deep trouble already.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help, big error, dd first GB of a raid:-/

2008-01-30 Thread Peter Rabbitson

Lars Schimmer wrote:


I activate the backup right now - was OpenAFS with some RW volumes -
fairly easy to backup, but...
If it's hard to recover raid data, I recreate the raid and forget the
old data on it.


It is not that hard to recover the raid itself, however the ext3 on top is 
most likely FUBAR (especially after 1GB was overwritten). Since like it seems 
the data is not that important to you, just roll back to a backup and move on.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-30 Thread Peter Rabbitson

Michael Tokarev wrote:


With 5-drive linux raid10:

   A  B  C  D  E
   0  0  1  1  2
   2  3  3  4  4
   5  5  6  6  7
   7  8  8  9  9
  10 10 11 11 12
   ...

AB can't be removed - 0, 5.  AC CAN be removed, as
are AD.  But not AE - losing 2 and 7.  And so on.


I stand corrected by Michael, this is indeed the case with the current state 
of md raid 10. Either my observations were incorrect when I made them a year 
and a half ago, or some fixes went into the kernel since then.


In any way - linux md10 does behave exactly as a classic raid 1+0 when created 
with -n D -p nS where D and S are both even and D = 2S.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-30 Thread Peter Rabbitson

Keld Jørn Simonsen wrote:

On Wed, Jan 30, 2008 at 03:47:30PM +0100, Peter Rabbitson wrote:

Michael Tokarev wrote:


With 5-drive linux raid10:

  A  B  C  D  E
  0  0  1  1  2
  2  3  3  4  4
  5  5  6  6  7
  7  8  8  9  9
 10 10 11 11 12
  ...

AB can't be removed - 0, 5.  AC CAN be removed, as
are AD.  But not AE - losing 2 and 7.  And so on.


I see. Does the kernel code allow this? And mdadm?

And can B+E be removed safely, and C+E and B+D? 



It seems like it. I just created the above raid configuration with 5 loop 
devices. Everything behaved just like Michael described. When the wrong drives 
disappeared - I started getting IO errors.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use new sb type

2008-01-29 Thread Peter Rabbitson

Tim Southerwood wrote:

David Greaves wrote:


IIRC Doug Leford did some digging wrt lilo + grub and found that 1.1 and 1.2
wouldn't work with them. I'd have to review the thread though...

David
-


For what it's worth, that was my finding too. -e 0.9+1.0 are fine with
GRUB, but  1.1 an 1.2 won't work under the filesystem that contains
/boot, at least with GRUB 1.x (I haven't used LILO for some time nor
have I tried the development GRUB 2).

The reason IIRC boils down to the fact that GRUB 1 isn't MD aware, and
the only reason one can get away with using it on a RAID 1 setup at
all is that the constituent devices present the same data as the
composite MD device, from the start.

Putting an MD SB at/near the beginning of the device breaks this case
and GRUB 1 doesn't know how to deal with it.

Cheers
Tim
-


I read the entire thread, and it seems that the discussion drifted away from 
the issue at hand. I hate flogging a dead horse, but here are my 2 cents:


First the summary:

* Currently LILO and GRUB are the only booting mechanisms widely available 
(GRUB2 is nowhere to be seen, and seems to be badly designed anyway)


* Both of these boot mechanisms do not understand RAID at all, so they can 
boot only off a block device containing a hackishly-readable filesystem (lilo: 
files are mappable, grub: a stage1.5 exists)


* The only raid level providing unfettered access to the underlying filesystem 
is RAID1 with a superblock at its end, and it has been common wisdom for years 
that you need RAID1 boot partition in order to boot anything at all.


The problem is that these three points do not affect any other raid level (as 
you can not boot from any of them in a reliable fashion anyway). I saw a 
number of voices that backward compatibility must be preserved. I don't see 
any need for that because:


* The distro managers will definitely RTM and will adjust their flashy GUIs to 
do the right thing by explicitly supplying -e 1.0 for boot devices


* A clueless user might burn himself by making a single root on a single raid1 
device. But wait - he can burn himself the same way by making the root a raid5 
device and rebooting.


Why do we sacrifice the right thing to do? To eliminate the possibility of 
someone shooting himself in the foot by not reading the manual?


Cheers
Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write-intent bitmaps

2008-01-29 Thread Peter Rabbitson

Russell Coker wrote:
Are there plans for supporting a NVRAM write-back cache with Linux software 
RAID?




AFAIK even today you can place the bitmap in an external file residing on a 
file system which in turn can reside on the nvram...


Peter

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Peter Rabbitson

Moshe Yudkowsky wrote:


One of the puzzling things about this is that I conceive of RAID10 as 
two RAID1 pairs, with RAID0 on top of to join them into a large drive. 
However, when I use --level=10  to create my md drive, I cannot find out 
which two pairs are the RAID1's: the --detail doesn't give that 
information. Re-reading the md(4) man page, I think I'm badly mistaken 
about RAID10.


Furthermore, since grub cannot find the /boot on the md drive, I deduce 
that RAID10 isn't what the 'net descriptions say it is.




It is exactly what the names implies - a new kind of RAID :) The setup you 
describe is not RAID10 it is RAID1+0. As far as how linux RAID10 works - here 
is an excellent article: 
http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10


Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Peter Rabbitson

Michael Tokarev wrote:
  Raid10 IS RAID1+0 ;)

It's just that linux raid10 driver can utilize more.. interesting ways
to lay out the data.


This is misleading, and adds to the confusion existing even before linux 
raid10. When you say raid10 in the hardware raid world, what do you mean? 
Stripes of mirrors? Mirrors of stripes? Some proprietary extension?


What Neil did was generalize the concept of N drives - M copies, and called it 
10 because it could exactly mimic the layout of conventional 1+0 [*]. However 
thinking about md level 10 in the terms of RAID 1+0 is wrong. Two examples 
(there are many more):


* mdadm -C -l 10 -n 3 -o f2 /dev/md10 /dev/sda1 /dev/sdb1 /dev/sdc1
Odd number of drives, no parity calculation overhead, yet the setup can still 
suffer a loss of a single drive


* mdadm -C -l 10 -n 2 -o f2 /dev/md10 /dev/sda1 /dev/sdb1
This seems useless at first, as it effectively creates a RAID1 setup, without 
preserving the FS format on disk. However md10 has read balancing code, so one 
could get a single thread sustained read at a speed twice what he could 
possibly get with md1 in the current implementation


I guess I will sit down tonight and craft some patches to the existing md* man 
pages. Some things are indeed left unsaid.


Peter

[*] The layout is the same but the functionality is different. If you have 1+0 
on 4 drives, you can survive a loss of 2 drives as long as they are part of 
different mirrors. mdadm -C -l 10 -n 4 -o n2 drives however will _NOT_ 
survive a loss of 2 drives.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Peter Rabbitson

Moshe Yudkowsky wrote:
Here's a baseline question: if I create a RAID10 array using default 
settings, what do I get? I thought I was getting RAID1+0; am I really?


Maybe you are, depending on your settings, but this is beyond the point. No 
matter what 1+0 you have (linux, classic, or otherwise) you can not boot from 
it, as there is no way to see the underlying filesystem without the RAID layer.


With the current state of affairs (available mainstream bootloaders) the rule 
is:
Block devices containing the kernel/initrd image _must_ be either:
* a regular block device (/sda1, /hda, /fd0, etc.)
* or a linux RAID 1 with the superblock at the end of the device (0.9 
or 1.2)


My superblocks, by the way, are marked version 01; my metadata in 
mdadm.conf asked for 1.2. I wonder what I really got.


This is how you find the actual raid version:

mdadm -D /dev/md[X] | grep Version

This will return a string of the form XX.YY.ZZ. Your superblock version is 
XX.YY.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Peter Rabbitson

Moshe Yudkowsky wrote:

Keld Jørn Simonsen wrote:


raid10 have a number of ways to do layout, namely the near, far and
offset ways, layout=n2, f2, o2 respectively.


The default layout, according to --detail, is near=2, far=1. If I 
understand what's been written so far on the topic, that's automatically 
incompatible with 1+0.




Unfortunately you are interpreting this wrong as well. far=1 is just a way of 
saying 'no copies of type far'.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BUG: possible array corruption when adding a component to a degraded raid5 (possibly other levels too)

2008-01-28 Thread Peter Rabbitson

Hello,

It seems that mdadm/md do not perform proper sanity checks before adding a 
component to a degraded array. If the size of the new component is just right, 
the superblock information will overlap with the data area. This will happen 
without any error indications in the syslog or otherwise.


I came up with a reproducible scenario which I am attaching to this email 
alongside with the entire test script. I have not tested it for other raid 
levels, or other types of superblocks, but I suspect the same problem will 
occur for many other configurations.


I am willing to test patches, however the attached script is non-intrusive 
enough to be executed anywhere.


The output of the script follows bellow.

Peter

==
==
==

[EMAIL PROTECTED]:/media/space/testmd# ./md_overlap_test
Creating component 1 (1056768 bytes)... done.
Creating component 2 (1056768 bytes)... done.
Creating component 3 (1056768 bytes)... done.


===
Creating 3 disk raid5 array with v1.1 superblock
mdadm: array /dev/md9 started.
Waiting for resync to finish... done.

md9 : active raid5 loop3[3] loop2[1] loop1[0]
  2048 blocks super 1.1 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

Initial checksum of raw raid5 device: 4df1921524a3b717a956fceaed0ae691  /dev/md9


===
Failing first componnent
mdadm: set /dev/loop1 faulty in /dev/md9
mdadm: hot removed /dev/loop1

md9 : active raid5 loop3[3] loop2[1]
  2048 blocks super 1.1 level 5, 64k chunk, algorithm 2 [3/2] [_UU]

Checksum of raw raid5 device after failing componnent: 
4df1921524a3b717a956fceaed0ae691  /dev/md9



===
Re-creating block device with size 1048576 bytes, so both the superblock and 
data start at the same spot

Adding back to array
mdadm: added /dev/loop1
Waiting for resync to finish... done.

md9 : active raid5 loop1[4] loop3[3] loop2[1]
  2048 blocks super 1.1 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

Checksum of raw raid5 device after adding back smaller component: 
bb854f77ad222d224fcdd8c8f96b51f0  /dev/md9



===
Attempting recovery
Waiting for recovery to finish... done.
Performing check
Waiting for check to finish... done.

Current value of mismatch_cnt: 0

Checksum of raw raid5 device after repair/check: 
146f5c37305c42cda64538782c8c3794  /dev/md9

[EMAIL PROTECTED]:/media/space/testmd#
#!/bin/bash

echo Please read the script first, and comment the exit line at the top.
echo This script will require about 3MB of free space, it will free (and use)
echo loop devices 1 2 and 3, and will use the md device number specified in 
MD_DEV.
exit 0

MD_DEV=md9# make sure this is not an array you use
COMP_NUM=3
COMP_SIZE=$((1 * 1024 * 1024 + 8192)) #1MiB comp sizes with room for 8k (16 
sect) of metadata

mdadm -S /dev/$MD_DEV /dev/null

DEVS=
for i in $(seq $COMP_NUM); do
echo -n Creating component $i ($COMP_SIZE bytes)... 
losetup -d /dev/loop${i} /dev/null

set -e
PCMD=print \\\x${i}${i}\ x $COMP_SIZE   # fill entire image with the 
component number (0xiii...)
perl -e $PCMD  dummy${i}.img
losetup /dev/loop${i} dummy${i}.img
DEVS=$DEVS /dev/loop${i}
set +e
echo done.
done

echo
echo
echo ===
echo Creating $COMP_NUM disk raid5 array with v1.1 superblock
# superblock at beginning of blockdev guarantees that it will overlap with real 
data, not with parity
mdadm -C /dev/$MD_DEV -l 5 -n $COMP_NUM -e 1.1 $DEVS

echo -n Waiting for resync to finish...
while [ $(cat /sys/block/$MD_DEV/md/sync_action) != idle ] ; do
echo -n .
sleep 1
done
echo  done.
echo
grep -A1 $MD_DEV /proc/mdstat 

echo
echo -n Initial checksum of raw raid5 device: 
md5sum /dev/$MD_DEV

echo
echo
echo ===
echo Failing first componnent
mdadm -f /dev/$MD_DEV /dev/loop1
mdadm -r /dev/$MD_DEV /dev/loop1

echo
grep -A1 $MD_DEV /proc/mdstat 

echo
echo -n Checksum of raw raid5 device after failing componnent: 
md5sum /dev/$MD_DEV

echo
echo
echo ===
NEWSIZE=$(( $COMP_SIZE - $(cat /sys/block/$MD_DEV/md/rd1/offset) * 512 ))
echo Re-creating block device with size $NEWSIZE bytes, so both the superblock 
and data start at the same spot
losetup -d /dev/loop1 /dev/null
PCMD=print \\\x11\ x $NEWSIZE
perl -e $PCMD  dummy1.img
losetup /dev/loop1 dummy1.img

echo Adding back to array
mdadm -a /dev/$MD_DEV /dev/loop1

echo -n Waiting for resync to finish...
while [ $(cat /sys/block/$MD_DEV/md/sync_action) != idle ] ; do
echo -n .

Re: BUG: possible array corruption when adding a component to a degraded raid5 (possibly other levels too)

2008-01-28 Thread Peter Rabbitson

Neil Brown wrote:

On Monday January 28, [EMAIL PROTECTED] wrote:

Hello,

It seems that mdadm/md do not perform proper sanity checks before adding a 
component to a degraded array. If the size of the new component is just right, 
the superblock information will overlap with the data area. This will happen 
without any error indications in the syslog or otherwise.


I thought I fixed that What versions of Linux kernel and mdadm are
you using for your tests?



Linux is 2.6.23.14 with everything md related compiled in (no modules)
mdadm - v2.6.4 - 19th October 2007 (latest in debian/sid)

Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use new sb type

2008-01-28 Thread Peter Rabbitson

David Greaves wrote:

Jan Engelhardt wrote:

This makes 1.0 the default sb type for new arrays.



IIRC there was a discussion a while back on renaming mdadm options (google Time
to  deprecate old RAID formats?) and the superblocks to emphasise the location
and data structure. Would it be good to introduce the new names at the same time
as changing the default format/on-disk-location?

David


Also wasn't the concession to make 1.1 default instead of 1.0 ?

Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Problem with raid5 grow/resize (not restripe)

2008-01-22 Thread Peter Rabbitson

Hello,

I can not seem to be able to extend slightly a raid volume of mine. I issue 
the command:


mdadm --grow --size=max /dev/md5

it completes and nothing happens. The kernel log is empty, however the even 
counter on the drive is incremented by +3.


Here is what I have (yes I know that I am resizing only by about 200MB). Why 
am I not able to reach 824.8GiB?


Thank you for your help.



[EMAIL PROTECTED]:~# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md5 : active raid5 sda3[4] sdd3[3] sdc3[2] sdb3[1]
  864276480 blocks super 1.1 level 5, 2048k chunk, algorithm 2 [4/4] []

md10 : active raid10 sdd2[3] sdc2[2] sdb2[1] sda2[0]
  5353472 blocks 1024K chunks 3 far-copies [4/4] []

md1 : active raid1 sdd1[1] sdc1[0] sdb1[3] sda1[2]
  56128 blocks [4/4] []

unused devices: none
[EMAIL PROTECTED]:~#



[EMAIL PROTECTED]:~# mdadm -D /dev/md5
/dev/md5:
Version : 01.01.03
  Creation Time : Tue Jan 22 03:52:42 2008
 Raid Level : raid5
 Array Size : 864276480 (824.24 GiB 885.02 GB)
  Used Dev Size : 576184320 (274.75 GiB 295.01 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 5
Persistence : Superblock is persistent

Update Time : Wed Jan 23 02:21:47 2008
  State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 2048K

   Name : Thesaurus:Crypta  (local to host Thesaurus)
   UUID : 1decb2d1:ebf16128:a240938a:669b0999
 Events : 5632

Number   Major   Minor   RaidDevice State
   4   830  active sync   /dev/sda3
   1   8   191  active sync   /dev/sdb3
   2   8   352  active sync   /dev/sdc3
   3   8   513  active sync   /dev/sdd3
[EMAIL PROTECTED]:~#



[EMAIL PROTECTED]:~# fdisk -l /dev/sd[abcd]

Disk /dev/sda: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x

   Device Boot  Start End  Blocks   Id  System
/dev/sda1   1   7   56196   fd  Linux raid autodetect
/dev/sda2   8 507 4016250   fd  Linux raid autodetect
/dev/sda3 508   36385   288190035   83  Linux
/dev/sda4   36386   4864198446320   83  Linux

Disk /dev/sdb: 320.0 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x

   Device Boot  Start End  Blocks   Id  System
/dev/sdb1   1   7   56196   fd  Linux raid autodetect
/dev/sdb2   8 507 4016250   fd  Linux raid autodetect
/dev/sdb3 508   36385   288190035   83  Linux
/dev/sdb4   36386   3891320306160   83  Linux

Disk /dev/sdc: 300.0 GB, 300090728448 bytes
255 heads, 63 sectors/track, 36483 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x

   Device Boot  Start End  Blocks   Id  System
/dev/sdc1   1   7   56196   fd  Linux raid autodetect
/dev/sdc2   8 507 4016250   fd  Linux raid autodetect
/dev/sdc3 508   36385   288190035   83  Linux
/dev/sdc4   36386   36483  787185   83  Linux

Disk /dev/sdd: 300.0 GB, 300090728448 bytes
255 heads, 63 sectors/track, 36483 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x

   Device Boot  Start End  Blocks   Id  System
/dev/sdd1   1   7   56196   fd  Linux raid autodetect
/dev/sdd2   8 507 4016250   fd  Linux raid autodetect
/dev/sdd3 508   36385   288190035   83  Linux
/dev/sdd4   36386   36483  787185   83  Linux
[EMAIL PROTECTED]:~#

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k

2007-06-28 Thread Peter Rabbitson

Justin Piszcz wrote:

mdadm --create \
  --verbose /dev/md3 \
  --level=5 \
  --raid-devices=10 \
  --chunk=1024 \
  --force \
  --run
  /dev/sd[cdefghijkl]1

Justin.


Interesting, I came up with the same results (1M chunk being superior) 
with a completely different raid set with XFS on top:


mdadm   --create \
--level=10 \
--chunk=1024 \
--raid-devices=4 \
--layout=f3 \
...

Could it be attributed to XFS itself?

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k

2007-06-28 Thread Peter Rabbitson

Justin Piszcz wrote:


On Thu, 28 Jun 2007, Peter Rabbitson wrote:

Interesting, I came up with the same results (1M chunk being superior) 
with a completely different raid set with XFS on top:


...

Could it be attributed to XFS itself?

Peter



Good question, by the way how much cache do the drives have that you are 
testing with?




I believe 8MB, but I am not sure I am looking at the right number:

[EMAIL PROTECTED]:~# hdparm -i /dev/sda

/dev/sda:

 Model=aMtxro7 2Y050M  , FwRev=AY5RH10W, 
SerialNo=6YB6Z7E4

 Config={ Fixed }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=DualPortCache, BuffSize=7936kB, MaxMultSect=16, MultSect=?0?
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5
 AdvancedPM=yes: disabled (255) WriteCache=enabled
 Drive conforms to: ATA/ATAPI-7 T13 1532D revision 0:  ATA/ATAPI-1 
ATA/ATAPI-2 ATA/ATAPI-3 ATA/ATAPI-4 ATA/ATAPI-5 ATA/ATAPI-6 ATA/ATAPI-7


 * signifies the current active mode

[EMAIL PROTECTED]:~#

1M chunk consistently delivered best performance with:

o A plain dumb dd run
o bonnie
o two bonnie threads
o iozone with 4 threads

My RA is set at 256 for the drives and 16384 for the array (128k and 8M 
respectively)

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LVM on raid10 - severe performance drop

2007-06-11 Thread Peter Rabbitson

Bernd Schubert wrote:


Try to increase the read-ahead size of your lvm devices:

blockdev --setra 8192 /dev/raid10/space

or increase it at least to the same size as of your raid (blockdev
--getra /dev/mdX).


This did the trick, although I am still lagging behind the raw md device 
by about 3 - 4%. Thanks for pointing this out!

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


question about --assume-clean

2007-06-09 Thread Peter Rabbitson

Hi,

I am about to create a large raid10 array, and I know for a fact that 
all the components are identical (dd if=/dev/zero of=/dev/sdXY). Is it 
safe to pass --assume-clean and spare 6 hours of reconstruction, or are 
there some hidden dangers in doing so?


Thanks

Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


LVM on raid10 - severe performance drop

2007-06-09 Thread Peter Rabbitson

Hi,

This question might be better suited for the lvm mailing list, but 
raid10 being rather new, I decided to ask here first. Feel free to 
direct me elsewhere.


I want to use lvm on top of a raid10 array, as I need the snapshot 
capability for backup purposes. The tuning and creation of the array 
went fine, I am getting the read performance I am looking for. However 
as soon as I create a VG using the array as the only PV, the raw read 
performance drops to the ground. I suspect it has to do with some minima 
l tuning of LVM parameters, but I am at a loss on what to tweak (and 
Google is certainly evil to me today). Below I am including my 
configuration and test results, please let me know if you spot anything 
wrong, or have any suggestions.


Thank you!

Peter



[EMAIL PROTECTED]:~# mdadm -D /dev/md1
/dev/md1:
Version : 00.90.03
  Creation Time : Sat Jun  9 15:28:01 2007
 Raid Level : raid10
 Array Size : 317444096 (302.74 GiB 325.06 GB)
  Used Dev Size : 238083072 (227.05 GiB 243.80 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Sat Jun  9 19:33:29 2007
  State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

 Layout : near=1, far=3
 Chunk Size : 1024K

   UUID : c16dbfd8:8a139e54:6e26228f:2ab99bd0 (local to host 
Arzamas)

 Events : 0.4

Number   Major   Minor   RaidDevice State
   0   820  active sync   /dev/sda2
   1   8   181  active sync   /dev/sdb2
   2   8   342  active sync   /dev/sdc2
   3   8   503  active sync   /dev/sdd2
[EMAIL PROTECTED]:~#


[EMAIL PROTECTED]:~# pvs -v
Scanning for physical volume names
  PV VG Fmt  Attr PSize   PFree   DevSize PV UUID 

  /dev/md1   raid10 lvm2 a-   302.73G 300.73G 302.74G 
vS7gT1-WTeh-kXng-Iw7y-gzQc-1KSH-mQ1PQk

[EMAIL PROTECTED]:~#


[EMAIL PROTECTED]:~# vgs -v
Finding all volume groups
Finding volume group raid10
  VG Attr   Ext   #PV #LV #SN VSize   VFree   VG UUID 

  raid10 wz--n- 4.00M   1   1   0 302.73G 300.73G 
ZosHXa-B1Iu-bax1-zMDk-FUbp-37Ff-k01aOK

[EMAIL PROTECTED]:~#


[EMAIL PROTECTED]:~# lvs -v
Finding all logical volumes
  LVVG #Seg Attr   LSize Maj Min KMaj KMin Origin Snap%  Move 
Copy%  Log LV UUID
  space raid101 -wi-a- 2.00G  -1  -1 253  0 
  i0p99S-tWFz-ELpl-bGXt-4CWz-Elr4-a1ao8f

[EMAIL PROTECTED]:~#


[EMAIL PROTECTED]:~# dd if=/dev/md1 of=/dev/null bs=1M count=2000
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 11.4846 seconds, 183 MB/s
[EMAIL PROTECTED]:~#


[EMAIL PROTECTED]:~# dd if=/dev/md1 of=/dev/null bs=512 count=400
400+0 records in
400+0 records out
204800 bytes (2.0 GB) copied, 11.4032 seconds, 180 MB/s
[EMAIL PROTECTED]:~#


[EMAIL PROTECTED]:~# dd if=/dev/raid10/space of=/dev/null bs=1M count=2000
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 25.7089 seconds, 81.6 MB/s
[EMAIL PROTECTED]:~#


[EMAIL PROTECTED]:~# dd if=/dev/raid10/space of=/dev/null bs=512 count=400
400+0 records in
400+0 records out
204800 bytes (2.0 GB) copied, 26.1776 seconds, 78.2 MB/s
[EMAIL PROTECTED]:~#


P.S. I know that dd is not the best benchmarking tool, but the 
difference is so big, that even this non-scientific approach works.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Customize the error emails of `mdadm --monitor`

2007-06-06 Thread Peter Rabbitson

Iustin Pop wrote:

On Wed, Jun 06, 2007 at 01:31:44PM +0200, Peter Rabbitson wrote:

Peter Rabbitson wrote:

Hi,

Is there a way to list the _number_ in addition to the name of a 
problematic component? The kernel trend to move all block devices into 
the sdX namespace combined with the dynamic name allocation renders 
messages like /dev/sdc1 has problems meaningless. It would make remote 
server support so much easier, by allowing the administrator to label 
drive trays Component0 Component1 Component2... etc, and be sure that 
the local tech support person will not pull out the wrong drive from the 
system.


Any takers? Or is it a RTFM question (in which case I certainly 
overlooked the relevant doc)?


If you use udev, have you looked in /dev/disk? I think it solves the
problem you need by allowing one to see either the disks by id or by
path. Making the reverse map is then trivial (for a reasonable number of
disks).



This would not work as arrays are assembled by the kernel at boot time, 
at which point there is no udev or anything else for that matter other 
than /dev/sdX. And I am pretty sure my OS (debian) does not support udev 
in initrd as of yet.


Pete
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Customize the error emails of `mdadm --monitor`

2007-06-06 Thread Peter Rabbitson

Gabor Gombas wrote:

On Wed, Jun 06, 2007 at 02:23:31PM +0200, Peter Rabbitson wrote:

This would not work as arrays are assembled by the kernel at boot time, at 
which point there is no udev or anything else for that matter other than 
/dev/sdX. And I am pretty sure my OS (debian) does not support udev in 
initrd as of yet.


But I think sending mails from the initrd isn't supported either, so if
you already hack the initrd, you can get the path information from
sysfs. udev is nothing magical, it just walks the sysfs tree and calls
some little helper programs when collecting the information for building
/dev/disk; you can do that yourself if you want.

Gabor



I think I did not make my problem clear enough. The _device name_ 
reported in the emails is the one with which the array was initially 
assembled. For this I have two choices:


* Kernel auto-assembly - the parts are properly detected and assembled, 
but there is no strong relationship between component number and sdX, 
especially if asynchronous scsi scanning takes place.


* Assembly by mdadm.conf - I can put whatever block devices I want in 
there, and they will be preserved in the email, but it is very 
cumbersome to do it for root and other system partitions.


So I was asking if the component _number_, which is unique to a specific 
device regardless of the assembly mechanism, can be reported in case of 
a failure.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Customize the error emails of `mdadm --monitor`

2007-06-06 Thread Peter Rabbitson

Gabor Gombas wrote:

On Wed, Jun 06, 2007 at 04:24:31PM +0200, Peter Rabbitson wrote:

So I was asking if the component _number_, which is unique to a specific 
device regardless of the assembly mechanism, can be reported in case of a 
failure.


So you need to write an event-handling script and pass it to mdadm
(--program). In the script you can walk sysfs and/or call the
appropriate helper programs to extract all the information you need and
format it in the way you want. For example, if you want the slot number
of a failed disk, you can get it from /sys/block/$2/md/dev-$3/slot
(according to the manpage, not tested).



Now that's some real advice. I have not thought of that. Thank you!

Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Customize the error emails of `mdadm --monitor`

2007-06-02 Thread Peter Rabbitson

Hi,

Is there a way to list the _number_ in addition to the name of a 
problematic component? The kernel trend to move all block devices into 
the sdX namespace combined with the dynamic name allocation renders 
messages like /dev/sdc1 has problems meaningless. It would make remote 
server support so much easier, by allowing the administrator to label 
drive trays Component0 Component1 Component2... etc, and be sure that 
the local tech support person will not pull out the wrong drive from the 
system.


Thanks

Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to synchronize two devices (RAID-1, but not really?)

2007-05-15 Thread Peter Rabbitson
Tomasz Chmielewski wrote:
 I have a RAID-10 setup of four 400 GB HDDs. As the data grows by several
 GBs a day, I want to migrate it somehow to RAID-5 on separate disks in a
 separate machine.
 
 Which would be easy, if I didn't have to do it online, without stopping
 any services.
 
 

Your /dev/md10 - what is directly on top of it? LVM? XFS? EXT3?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid1 replaced with raid10?

2007-05-07 Thread Peter Rabbitson
Neil Brown wrote:
 On Friday May 4, [EMAIL PROTECTED] wrote:
 Peter Rabbitson wrote:
 Hi,

 I asked this question back in march but received no answers, so here it
 goes again. Is it safe to replace raid1 with raid10 where the amount of
 disks is equal to the amount of far/near/offset copies? I understand it
 has the downside of not being a bit-by-bit mirror of a plain filesystem.
 Are there any other caveats?
   
 
 To answer the original question, I assume you mean replace as in
 backup, create new array, then restore.
 You will get different performance characteristics.  Whether they
 better suit your needs or not will depend largely on your needs.

Hi Neil,
Yes I meant take an existing 2 drive raid1 array (non bootable data) and
put a raid10 array in its place. All my testing indicates that I get the
same write performance but nearly double the read speed (due to
interleaving I guess). It seemed to good to be true, thus I am asking
the question. Could you elaborate on your last sentence? Are there
downsides I could not think of? Thank you!

Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid1 replaced with raid10?

2007-05-07 Thread Peter Rabbitson
Neil Brown wrote:
 On Monday May 7, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
 On Friday May 4, [EMAIL PROTECTED] wrote:
 Peter Rabbitson wrote:
 Hi,

 I asked this question back in march but received no answers, so here it
 goes again. Is it safe to replace raid1 with raid10 where the amount of
 disks is equal to the amount of far/near/offset copies? I understand it
 has the downside of not being a bit-by-bit mirror of a plain filesystem.
 Are there any other caveats?
   
 To answer the original question, I assume you mean replace as in
 backup, create new array, then restore.
 You will get different performance characteristics.  Whether they
 better suit your needs or not will depend largely on your needs.
 Hi Neil,
 Yes I meant take an existing 2 drive raid1 array (non bootable data) and
 put a raid10 array in its place. All my testing indicates that I get the
 same write performance but nearly double the read speed (due to
 interleaving I guess). It seemed to good to be true, thus I am asking
 the question. Could you elaborate on your last sentence? Are there
 downsides I could not think of? Thank you!
 
 I would have thought that you need far or offset to improve read
 performance, and they tend to hurt write performance (though I haven't
 really measured offset much).
 
 What layout are you using?
 

Correct, I am using 'far' layout. The interleaving of the 'offset'
layout does not work too good for sequential reads, but far really
shines. Yes write performance is hurt by about 10%. Compared to 190%
gain in reads I can live with it.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid1 replaced with raid10?

2007-05-07 Thread Peter Rabbitson
Bill Davidsen wrote:
 Not worth a repost, since I was way over answering his question...

Erm... and now you made me curios :) Please share your thoughts if it is
not too much trouble. Thank you for your time.

Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Speed variation depending on disk position (was: Linux SW RAID: HW Raid Controller/JBOD vs. Multiple PCI-e Cards?)

2007-05-05 Thread Peter Rabbitson
Chris Wedgwood wrote:
 snip
 
 Also, 'dd performance' varies between the start of a disk and the end.
 Typically you get better performance at the start of the disk so dd
 might not be a very good benchmark here.
 

Hi,
Sorry for hijacking this thread, but I was actually planning to ask this
very same question. Is the behavior you are describing above
manufacturer dependent or it is pretty much dictated by the general
design of modern drives? I have an array of 4 Maxtor sata drives, and
raw read performance at the end of the disk is 38mb/s compared to 62mb/s
at the beginning.

Thanks
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Raid1 replaced with raid10?

2007-05-04 Thread Peter Rabbitson
Hi,

I asked this question back in march but received no answers, so here it
goes again. Is it safe to replace raid1 with raid10 where the amount of
disks is equal to the amount of far/near/offset copies? I understand it
has the downside of not being a bit-by-bit mirror of a plain filesystem.
Are there any other caveats?

Thanks

Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS sunit/swidth for raid10

2007-03-23 Thread Peter Rabbitson

dean gaudet wrote:

On Thu, 22 Mar 2007, Peter Rabbitson wrote:


dean gaudet wrote:

On Thu, 22 Mar 2007, Peter Rabbitson wrote:


Hi,
How does one determine the XFS sunit and swidth sizes for a software
raid10
with 3 copies?

mkfs.xfs uses the GET_ARRAY_INFO ioctl to get the data it needs from
software raid and select an appropriate sunit/swidth...

although i'm not sure i agree entirely with its choice for raid10:

So do I, especially as it makes no checks for the amount of copies (3 in my
case, not 2).


it probably doesn't matter.

This was essentially my question. For an array -pf3 -c1024 I get swidth = 4 *
sunit = 4MiB. Is it about right and does it matter at all?


how many drives?



Sorry. 4 drives, 3 far copies (so any 2 drives can fail), 1M chunk.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


XFS sunit/swidth for raid10

2007-03-21 Thread Peter Rabbitson

Hi,
How does one determine the XFS sunit and swidth sizes for a software 
raid10 with 3 copies?


Thanks

Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid10 far layout outperforms offset at writing? (was: Help with chunksize on raid10 -p o3 array)

2007-03-19 Thread Peter Rabbitson

Peter Rabbitson wrote:
I have been trying to figure out the best chunk size for raid10 before 
migrating my server to it (currently raid1). I am looking at 3 offset 
stripes, as I want to have two drive failure redundancy, and offset 
striping is said to have the best write performance, with read 
performance equal to far.


Incorporating suggestions from previous posts (thank you everyone), I 
used this modified script at http://rabbit.us/pool/misc/raid_test2.txt 
To negate effects of caching memory was jammed below 200mb free by using 
a full tmpfs mount with no swap. Here is what I got with far layout (-p 
f3): http://rabbit.us/pool/misc/raid_far.html The clear winner is 1M 
chunks, and is very consistent at any block size. I was surprised even 
more to see that my read speed was identical to that of a raid0 getting 
near the _maximum_ physical speed of 4 drives (roughly 55MB sustained 
across 1.2G). Unlike offset layout, far really shines at reading stuff 
back. The write speed did not suffer noticeably compared to offset 
striping. Here are the results (-p o3) for comparison: 
http://rabbit.us/pool/misc/raid_offset.html, and they roughly seem to 
correlate with my earlier testing using dd.


So I guess the way to go for this system will be f3, although the md(4) 
says that offset layout should be more beneficial. Is there anything I 
missed while setting my o3 array, so that I got worse performance for 
both read and write compared to f3?


Once again thanks everyone for the help.
Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Raid1 replaced with raid10?

2007-03-19 Thread Peter Rabbitson

Hi,

I just tried an idea I got after fiddling with raid10 and to my dismay 
it worked as I thought it will. I used two small partitions on separate 
disks to create a raid1 array. Then I did dd if=/dev/md2 of=/dev/null. I 
got only one of the disks reading. Nothing unexpected. Then I created a 
raid10 array on the same two partitions with the options -l10 -n2 -pf2. 
The same dd executed at twice the speed, reading _simultaneously_ from 
both drives. I did some bonnie++ benchmarking - same result - raid1 
reads only from a single disk raid10 from both. Write performance is 
worse (about 10% slower) with raid10, but you get twice the read speed.
In this light the obvious question is: can raid10 be used as a drop-in 
replacement for raid1 or there is a caveat with having the amount of 
disks equal the amount of chunk copies?


Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help with chunksize on raid10 -p o3 array

2007-03-12 Thread Peter Rabbitson

Neil Brown wrote:

The different block sizes in the reads will make very little
difference to the results as the kernel will be doing read-ahead for
you.  If you want to really test throughput at different block sizes
you need to insert random seeks.



Neil, thank you for the time and effort to answer my previous email. 
Excellent insights. I thought that read-ahead is filesystem specific, 
and subsequently I would be safe to use the raw device. I will 
definitely test with bonnie again.


* Why although I have 3 identical chunks of data at any time, dstat 
never showed simultaneous reading from more than 2 drives. Every dd run 
was accompanied by maxing out one of the drives at 58MB/s and another 
one was trying to catch up to various degrees depending on the chunk 
size. Then on the next dd run two other drives would be (seemingly 
random) selected and the process would repeat.


Poor read-balancing code.  It really needs more thought.
Possibly for raid10 we shouldn't try to balance at all.  Just read
from the 'first' copy in each case


Is this anywhere near the top of the todo list, or for now raid10 users 
are bound to a maximum read speed of a two drive combination?


And a last question - earlier in this thread Bill Davidsen suggested to 
play with the stripe_cache_size. I tried to increase it (did just two 
tests though) with no apparent effect. Does this setting apply to 
raid1/10 at all or it is strictly in the raid5/6 domain? If so, are 
there any tweaks apart from the chunk size and the layout that can 
affect raid10 performance?


Once again thank you for the help.

Peter
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help with chunksize on raid10 -p o3 array

2007-03-12 Thread Peter Rabbitson

Richard Scobie wrote:

Peter Rabbitson wrote:

Is this anywhere near the top of the todo list, or for now raid10 
users are bound to a maximum read speed of a two drive combination?


I have not done any testing with the md native RAID10 implementations, 
so perhaps there are some other advantages, but have you tried setting 
up your 4 drives as a RAID 0 made up of a pair of RAID1s?


The advantage is higher redundancy when I can have any two drives fail 
in a x3 layout unlike the raid1/0 setup, although I sacrifice available 
disk space. But yes, I agree that if I was after pure throughput raid1/0 
would be more beneficial, with the downside of 1.5 disk failure redundancy.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help with chunksize on raid10 -p o3 array

2007-03-07 Thread Peter Rabbitson

Bill Davidsen wrote:

Peter Rabbitson wrote:

Hi,
I have been trying to figure out the best chunk size for raid10 before 


By any chance did you remember to increase stripe_cache_size to match 
the chunk size? If not, there you go.


At the end of /usr/src/linux/Documentation/md.txt it specifically says 
that stripe_cache_size is raid5 specific, and it made sense to me, as 
caching stuff to avoid re-doing parity is beneficial. I will test later 
today trying to set the cache higher. Are there any guidelines on how 
large should it be in relation to the chunk size/number of drives for 
raid10?

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions - how about raid10?

2007-03-06 Thread Peter Rabbitson

Neil Brown wrote:

When we write to a raid1, the data is DMAed from memory out to each
device independently, so if the memory changes between the two (or
more) DMA operations, you will get inconsistency between the devices.


Does this apply to raid 10 devices too? And in case of LVM if swap is on 
top of a LV which is a part of a VG which has a single PV as the raid 
array - will this happen as well? Or will the LVM layer take the data 
once and distribute exact copies of it to the PVs (in this case just the 
raid) effectively giving the raid array invariable data?

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions - how about raid10?

2007-03-06 Thread Peter Rabbitson

Neil Brown wrote:

On Tuesday March 6, [EMAIL PROTECTED] wrote:

Neil Brown wrote:

When we write to a raid1, the data is DMAed from memory out to each
device independently, so if the memory changes between the two (or
more) DMA operations, you will get inconsistency between the devices.
Does this apply to raid 10 devices too? And in case of LVM if swap is on 
top of a LV which is a part of a VG which has a single PV as the raid 
array - will this happen as well? Or will the LVM layer take the data 
once and distribute exact copies of it to the PVs (in this case just the 
raid) effectively giving the raid array invariable data?


Yes, it applies to raid10 too.

I don't know the details of the inner workings of LVM, but I doubt it
will make a difference.  Copying the data in memory is just too costly
to do if it can be avoided.  With LVM and raid1/10 it can be avoided
with no significant cost.
With raid4/5/6, not copying into the cache can cause data corruption.
So we always copy.



I see. So basically for those of us who want to run swap on raid 1 or 
10, and at the same time want to rely on mismatch_cnt for early problem 
detection, the only option is to create a separate md device just for 
the swap. Is this about right?

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


swap on raid

2007-03-01 Thread Peter Rabbitson
Hi,

I need to use a raid volume for swap, utilizing partitions from 4 
physical drives I have available. From my experience I have three 
options - raid5, raid10 with 2 offset chunks, and two raid 1 volumes 
that are swapon-ed with equal priority. However I have a hard time 
figuring out what to use as I am not really sure how can I detect the 
usage patterns of swap, left alone benchmark it. Has anyone done 
anything like this, or is there information on what kind of reads/writes 
the kernel performs when paging in and out?

Before you answer my question - yes, I am painfully aware of the 
paradigm swap on raid is bad, and I know there are other ways to solve 
it, but my situation requires me to have swap. Several weeks ago a drive 
failed and took a full partition away bringing the system to its knees 
and causied massive data corruption. I am also aware that I can use a 
file that will reside alongside my other data, but fragmentation makes 
this approach inefficient. So I am looking into placing the swap 
directly on a raid voulme.


Thanks 

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: swap on raid

2007-03-01 Thread Peter Rabbitson
 The fact that you mention you are using partitions on disks that 
 possibly have other partions doing other things, means raw performance 
 will be compromised anyway.
 
 Regards,
 
 Richard

You know I never thought about it, but you are absolutely right. The 
times at which my memory usage peaks coincide with high disk activity 
(mostly reads). In this light it actually might be better to keep the 
swap in a file on my raid10 (-p n3) which occupies most of these 4 
drives, and hope that the md code will be able to distribute the io 
across idle drives. Does this sound about right?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID0 to RAID5 upgrade

2007-03-01 Thread Peter Rabbitson
On Thu, Mar 01, 2007 at 06:12:32PM -0500, Bill Davidsen wrote:
 I have three drives, with some various partitions, currently set up like 
 this.
 
  drive0drive1drive2
 
   hdb1  hdi1  hdk1
   \_RAID1/
 
   hdb2  hdi2  hdk2
  unused \___RAID0/
   200GB   100GB x 2
 
 hdi3  hdk3
 \___unused___/
100GB x 2
 
 What I want to have is 3 x 200 = 400GB RAID5.
 
 I would like to avoid copying 200GB of data to another machine and back 

Can't you do the following:

* copy the data from raid0 to hdb2 ( raid0 = hdb2 you can even do a dd)
* degrade raid1 to only contain drive0
* since you have all your data on drive0, wipe drive1 and drive2 clean, 
create a degraded raid5
* copy stuff from drive0 to the new array (enw fs as well I presume)
* resync the raid5 with drive0


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID10 Resync fails with specific chunk size and drive sizes (reproducible)

2007-02-20 Thread Peter Rabbitson
Hi,

I think I've hit a reproducible bug in the raid 10 driver, tried on two 
different machines with kernels 2.6.20 and 2.6.18. This is a script to 
simulate the problem:

==
#!/bin/bash

modprobe loop

for ID in 1 2 3 ; do
echo -n Creating loopback device $ID... 
dd if=/dev/zero of=dsk${ID}.img bs=512 count=995967
losetup /dev/loop${ID} dsk${ID}.img
echo done.
done

mdadm -C /dev/md2 -l 10 -n 3 -p o2 -c 2048 /dev/loop1 /dev/loop2 /dev/loop3
echo Raid device assembled, check /proc/mdstat's output when resync is 
finished
==

This is the output I get in /proc/mdstat after the resync settles:

==
md2 : active raid10 loop3[2] loop2[3](F) loop1[0]
  746496 blocks 2048K chunks 2 offset-copies [3/2] [U_U]
==
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID10 Resync fails with specific chunk size and drive sizes (reproducible)

2007-02-20 Thread Peter Rabbitson
After I sent the message I received the 6 patches from Neil Brown. I 
applied the first one (Fix Raid10 recovery problem) and it seems to be 
taking care of the issue I am describing. Probably due to the rounding 
fixes.

Thanks 

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html