Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-06 Thread Luca Berra

On Mon, Feb 04, 2008 at 07:38:40PM +0300, Michael Tokarev wrote:

Eric Sandeen wrote:
[]

http://oss.sgi.com/projects/xfs/faq.html#nulls

and note that recent fixes have been made in this area (also noted in
the faq)

Also - the above all assumes that when a drive says it's written/flushed
data, that it truly has.  Modern write-caching drives can wreak havoc
with any journaling filesystem, so that's one good reason for a UPS.  If


Unfortunately an UPS does not *really* help here.  Because unless
it has control program which properly shuts system down on the loss
of input power, and the battery really has the capacity to power the
system while it's shutting down (anyone tested this?  With new UPS?
and after an year of use, when the battery is not new?), -- unless
the UPS actually has the capacity to shutdown system, it will cut
the power at an unexpected time, while the disk(s) still has dirty
caches...


if the ups is supported by nut (http://www.networkupstools.org) you can
do this easily.
Obviously you should tune the timeout to give your systems enough time
to shutdown in case of power outage, and periodically check your
battery duration (that means real tests) and re-tune the nut software
(and when you discover your battery is dead, change it)

L.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-06 Thread Peter Rabbitson

Janek Kozicki wrote:

writing on raid10 is supposed to be half the speed of reading. That's
because it must write to both mirrors.



I am not 100% certain about the following rules, but afaik any raid 
configuration has a theoretical[1] maximum read speed of the combined speed of 
all disks in the array and a maximum write speed equal to the combined speed 
of a disk-length of a stripe. By disk-length I mean how many disks are needed 
to reconstruct a single stripe - the rest of the writes are redundancy and are 
essentially non-accountable work. For raid5 it is N-1. For raid6 - N-2. For 
linux raid 10 it is N-C+1 where C is the number of chunk copies. So for -p n3 
-n 5 we would get a maximum write speed of 3 x single drive speed. For raid1 
the disk-length of a stripe is always 1.


So the statement

IMHO raid5 could perform good here, because in *continuous* write
operation the blocks from other HDDs were just have been written,
they stay in cache and can be used to calculate xor. So you could get
close to almost raid-0 performance here.
is quite incorrect. You will get close to raid-0 if you have many disks, but 
will never beat raid0, since once disk is always busy writing parity which is 
not part of the write request submitted to the mdX device in the first place.


[1] Theoretical since any external factors (busy CPU, unsuitable elevator, 
random disk access, multiple raid levels on one physical device) would all 
contribute to take you further away from the maximums.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: draft howto on making raids for surviving a disk crash

2008-02-06 Thread Luca Berra

On Sat, Feb 02, 2008 at 08:41:31PM +0100, Keld Jørn Simonsen wrote:

Make each of the disks bootable by lilo:

  lilo -b /dev/sda /etc/lilo.conf1
  lilo -b /dev/sdb /etc/lilo.conf2

There should be no need for that.
to achieve the above effect with lilo you use
raid-extra-boot=mbr-only
in lilo.conf


Make each of the disks bootable by grub

install grub with the command
grub-install /dev/md0

L.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread Marcin Krol
Tuesday 05 February 2008 21:12:32 Neil Brown napisał(a):

  % mdadm --zero-superblock /dev/sdb1
  mdadm: Couldn't open /dev/sdb1 for write - not zeroing
 
 That's weird.
 Why can't it open it?

Hell if I know. First time I see such a thing. 

 Maybe you aren't running as root (The '%' prompt is suspicious).

I am running as root, the % prompt is the obfuscation part (I have
configured bash to display IP as part of prompt).

 Maybe the kernel has  been told to forget about the partitions of
 /dev/sdb.

But fdisk/cfdisk has no problem whatsoever finding the partitions .

 mdadm will sometimes tell it to do that, but only if you try to
 assemble arrays out of whole components.

 If that is the problem, then
blockdev --rereadpt /dev/sdb

I deleted LVM devices that were sitting on top of RAID and reinstalled mdadm.

% blockdev --rereadpt /dev/sdf
BLKRRPART: Device or resource busy

% mdadm /dev/md2 --fail /dev/sdf1
mdadm: set /dev/sdf1 faulty in /dev/md2

% blockdev --rereadpt /dev/sdf
BLKRRPART: Device or resource busy

% mdadm /dev/md2 --remove /dev/sdf1
mdadm: hot remove failed for /dev/sdf1: Device or resource busy

lsof /dev/sdf1 gives ZERO results.

arrrRRRGH

Regards,
Marcin Krol
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread Marcin Krol
Tuesday 05 February 2008 12:43:31 Moshe Yudkowsky napisał(a):

  1. Where this info on array resides?! I have deleted /etc/mdadm/mdadm.conf 
  and /dev/md devices and yet it comes seemingly out of nowhere.

 /boot has a copy of mdadm.conf so that / and other drives can be started 
 and then mounted. update-initramfs will update /boot's copy of mdadm.conf.

Yeah, I found that while deleting mdadm package...

Thanks for answers everyone anyway.

Regards,
Marcin Krol


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread Peter Rabbitson

Marcin Krol wrote:

Tuesday 05 February 2008 21:12:32 Neil Brown napisał(a):


% mdadm --zero-superblock /dev/sdb1
mdadm: Couldn't open /dev/sdb1 for write - not zeroing

That's weird.
Why can't it open it?


Hell if I know. First time I see such a thing. 


Maybe you aren't running as root (The '%' prompt is suspicious).


I am running as root, the % prompt is the obfuscation part (I have
configured bash to display IP as part of prompt).


Maybe the kernel has  been told to forget about the partitions of
/dev/sdb.


But fdisk/cfdisk has no problem whatsoever finding the partitions .


mdadm will sometimes tell it to do that, but only if you try to
assemble arrays out of whole components.



If that is the problem, then
   blockdev --rereadpt /dev/sdb


I deleted LVM devices that were sitting on top of RAID and reinstalled mdadm.

% blockdev --rereadpt /dev/sdf
BLKRRPART: Device or resource busy

% mdadm /dev/md2 --fail /dev/sdf1
mdadm: set /dev/sdf1 faulty in /dev/md2

% blockdev --rereadpt /dev/sdf
BLKRRPART: Device or resource busy

% mdadm /dev/md2 --remove /dev/sdf1
mdadm: hot remove failed for /dev/sdf1: Device or resource busy

lsof /dev/sdf1 gives ZERO results.



What does this say:

dmsetup table

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread Neil Brown
On Wednesday February 6, [EMAIL PROTECTED] wrote:
 
  Maybe the kernel has  been told to forget about the partitions of
  /dev/sdb.
 
 But fdisk/cfdisk has no problem whatsoever finding the partitions .

It is looking at the partition table on disk.  Not at the kernel's
idea of partitions, which is initialised from that table...

What does

  cat /proc/partitions

say?

 
  mdadm will sometimes tell it to do that, but only if you try to
  assemble arrays out of whole components.
 
  If that is the problem, then
 blockdev --rereadpt /dev/sdb
 
 I deleted LVM devices that were sitting on top of RAID and reinstalled mdadm.
 
 % blockdev --rereadpt /dev/sdf
 BLKRRPART: Device or resource busy
 

Implies that some partition is in use.

 % mdadm /dev/md2 --fail /dev/sdf1
 mdadm: set /dev/sdf1 faulty in /dev/md2
 
 % blockdev --rereadpt /dev/sdf
 BLKRRPART: Device or resource busy
 
 % mdadm /dev/md2 --remove /dev/sdf1
 mdadm: hot remove failed for /dev/sdf1: Device or resource busy

OK, that's weird.  If sdf1 is faulty, then you should be able to
remove it.  What does
  cat /proc/mdstat
  dmesg | tail

say at this point?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread Marcin Krol
Wednesday 06 February 2008 11:11:51 Peter Rabbitson napisał(a):
  lsof /dev/sdf1 gives ZERO results.
  
 
 What does this say:
 
   dmsetup table


% dmsetup table
vg-home: 0 61440 linear 9:2 384

Regards,
Marcin Krol
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread David Greaves
Marcin Krol wrote:
 Hello everyone,
 
 I have had a problem with RAID array (udev messed up disk names, I've had 
 RAID on
 disks only, without raid partitions)

Do you mean that you originally used /dev/sdb for the RAID array? And now you
are using /dev/sdb1?

Given the system seems confused I wonder if this may be relevant?

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread Marcin Krol
Wednesday 06 February 2008 12:22:00:
  I have had a problem with RAID array (udev messed up disk names, I've had 
  RAID on
  disks only, without raid partitions)
 
 Do you mean that you originally used /dev/sdb for the RAID array? And now you
 are using /dev/sdb1?

That's reconfigured now, it doesn't matter (started up the host in single user, 
created
partitions as opposed to running RAID previously on whole disks).
 
 Given the system seems confused I wonder if this may be relevant?

I don't think so, I tried most mdadm operations (fail, remove, etc) on disks 
(like sdb) and 
partitions (like sdb1) and get identical messages for either.


-- 
Marcin Krol

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread Marcin Krol
Wednesday 06 February 2008 11:43:12:
 On Wednesday February 6, [EMAIL PROTECTED] wrote:
  
   Maybe the kernel has  been told to forget about the partitions of
   /dev/sdb.
  
  But fdisk/cfdisk has no problem whatsoever finding the partitions .
 
 It is looking at the partition table on disk.  Not at the kernel's
 idea of partitions, which is initialised from that table...

Aha! Thanks for this bit. I get it now.

 What does
 
   cat /proc/partitions
 
 say?

Note: I have reconfigured udev now to associate device names with serial
numbers (below)

% cat /proc/partitions
major minor  #blocks  name

   8 0  390711384 sda
   8 1  390708801 sda1
   816  390711384 sdb
   817  390708801 sdb1
   832  390711384 sdc
   833  390708801 sdc1
   848  390710327 sdd
   849  390708801 sdd1
   864  390711384 sde
   865  390708801 sde1
   880  390711384 sdf
   881  390708801 sdf1
   364   78150744 hdb
   3651951866 hdb1
   3667815622 hdb2
   3674883760 hdb3
   368  1 hdb4
   369 979933 hdb5
   370 979933 hdb6
   371   61536951 hdb7
   9 1  781417472 md1
   9 0  781417472 md0



/dev/disk/by-id % ls -l

total 0
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 ata-ST380023A_3KB0MV22 - ../../hdb
lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part1 - 
../../hdb1
lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part2 - 
../../hdb2
lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part3 - 
../../hdb3
lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part4 - 
../../hdb4
lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part5 - 
../../hdb5
lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part6 - 
../../hdb6
lrwxrwxrwx 1 root root 10 2008-02-06 13:34 ata-ST380023A_3KB0MV22-part7 - 
../../hdb7
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1696130 
- ../../d_6
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 
ata-WDC_WD4000KD-00N-WD-WMAMY1696130-part1 - ../../d_6
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1707974 
- ../../d_5
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 
ata-WDC_WD4000KD-00N-WD-WMAMY1707974-part1 - ../../d_5
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1795228 
- ../../d_1
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 
ata-WDC_WD4000KD-00N-WD-WMAMY1795228-part1 - ../../d_1
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1795364 
- ../../d_3
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 
ata-WDC_WD4000KD-00N-WD-WMAMY1795364-part1 - ../../d_3
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1798692 
- ../../d_2
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 
ata-WDC_WD4000KD-00N-WD-WMAMY1798692-part1 - ../../d_2
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 ata-WDC_WD4000KD-00N-WD-WMAMY1800255 
- ../../d_4
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 
ata-WDC_WD4000KD-00N-WD-WMAMY1800255-part1 - ../../d_4
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1696130 - ../../d_6
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1696130-part1 - 
../../d_6
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1707974 - ../../d_5
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1707974-part1 - 
../../d_5
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1795228 - ../../d_1
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1795228-part1 - 
../../d_1
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1795364 - ../../d_3
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1795364-part1 - 
../../d_3
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1798692 - ../../d_2
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1798692-part1 - 
../../d_2
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1800255 - ../../d_4
lrwxrwxrwx 1 root root  9 2008-02-06 13:34 scsi-S_WD-WMAMY1800255-part1 - 
../../d_4

I have no idea why udev can't allocate /dev/d_1p1 to partition 1 on disk d_1. I 
have
explicitly asked it to do that:

/etc/udev/rules.d % cat z24_disks_domeny.rules


KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1795228, 
NAME=d_1
KERNEL==sd*, SUBSYSTEM==block, 
ENV{ID_SERIAL_SHORT}==WD-WMAMY1795228-part1, NAME=d_1p1

KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1798692, 
NAME=d_2
KERNEL==sd*, SUBSYSTEM==block, 
ENV{ID_SERIAL_SHORT}==WD-WMAMY1798692-part1, NAME=d_2p1

KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1795364, 
NAME=d_3
KERNEL==sd*, SUBSYSTEM==block, 
ENV{ID_SERIAL_SHORT}==WD-WMAMY1795364-part1, NAME=d_3p1

KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1800255, 
NAME=d_4
KERNEL==sd*, SUBSYSTEM==block, 
ENV{ID_SERIAL_SHORT}==WD-WMAMY1800255-part1, NAME=d_4p1

KERNEL==sd*, SUBSYSTEM==block, ENV{ID_SERIAL_SHORT}==WD-WMAMY1707974, 
NAME=d_5
KERNEL==sd*, SUBSYSTEM==block, 

Disk failure during grow, what is the current state.

2008-02-06 Thread Steve Fairbairn

Hi All,

I was wondering if someone might be willing to confirm what the current
state of my RAID array is, given the following sequence of events (sorry
it's pretty long)

I had a clean, running /dev/md0 using 5 disks in RAID 5 (sda1, sdb1,
sdc1, sdd1, hdd1).  It had been clean like that for a while.  So last
night I decided it was safe to grow the array into a sixth disk

[EMAIL PROTECTED] ~]# mdadm /dev/md0 --add /dev/hdi1
mdadm: added /dev/hdi1
[EMAIL PROTECTED] ~]# mdadm -D /dev/md0
/dev/md0:
Version : 00.90.03
  Creation Time : Wed Jan  9 18:57:53 2008
 Raid Level : raid5
 Array Size : 1953535744 (1863.04 GiB 2000.42 GB)
  Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Tue Feb  5 23:55:59 2008
  State : clean
 Active Devices : 5
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 1

 Layout : left-symmetric
 Chunk Size : 64K

   UUID : 382c157a:405e0640:c30f9e9e:888a5e63
 Events : 0.429616

Number   Major   Minor   RaidDevice State
   0   810  active sync   /dev/sda1
   1   8   171  active sync   /dev/sdb1
   2   8   332  active sync   /dev/sdc1
   3  22   653  active sync   /dev/hdd1
   4   8   494  active sync   /dev/sdd1

   5  561-  spare   /dev/hdi1
[EMAIL PROTECTED] ~]# mdadm --grow /dev/md0 --raid-devices=6
mdadm: Need to backup 1280K of critical section..
mdadm: ... critical section passed.
[EMAIL PROTECTED] ~]# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 hdi1[5] sdd1[4] sdc1[2] sdb1[1] sda1[0] hdd1[3]
  1953535744 blocks super 0.91 level 5, 64k chunk, algorithm 2 [6/6]
[UU]
  []  reshape =  0.0% (29184/488383936)
finish=2787.4min speed=2918K/sec
  
unused devices: none
[EMAIL PROTECTED] ~]# 

OK, so that would take nearly 2 days to complete, so I went to bed happy
about 10 hours ago.

I come to the machine this morning, and I have the following

[EMAIL PROTECTED] ~]# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 hdi1[5] sdd1[6](F) sdc1[2] sdb1[1] sda1[0] hdd1[3]
  1953535744 blocks super 0.91 level 5, 64k chunk, algorithm 2 [6/5]
[_U]
  
unused devices: none
You have new mail in /var/spool/mail/root
[EMAIL PROTECTED] ~]# mdadm -D /dev/md0
/dev/md0:
Version : 00.91.03
  Creation Time : Wed Jan  9 18:57:53 2008
 Raid Level : raid5
 Array Size : 1953535744 (1863.04 GiB 2000.42 GB)
  Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Wed Feb  6 05:28:09 2008
  State : clean, degraded
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 64K

  Delta Devices : 1, (5-6)

   UUID : 382c157a:405e0640:c30f9e9e:888a5e63
 Events : 0.470964

Number   Major   Minor   RaidDevice State
   0   810  active sync   /dev/sda1
   1   8   171  active sync   /dev/sdb1
   2   8   332  active sync   /dev/sdc1
   3  22   653  active sync   /dev/hdd1
   4   004  removed
   5  5615  active sync   /dev/hdi1

   6   8   49-  faulty spare
[EMAIL PROTECTED] ~]# df -k
Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
  56086828  11219432  41972344  22% /
/dev/hda1   101086 18281 77586  20% /boot
/dev/md0 1922882096 1775670344  69070324  97% /Downloads
tmpfs   513556 0513556   0% /dev/shm
[EMAIL PROTECTED] ~]# mdadm /dev/md0 --remove /dev/sdd1
mdadm: cannot find /dev/sdd1: No such file or directory [EMAIL PROTECTED] ~]# 

As you can see, one of the original 5 devices has failed (sdd1) and
automatically removed.  The reshape has stopped, but the new disk seems
to be in and clean which is the bit I don't understand.  The new disk
hasn't been added to the size, so it would seem that md has switched it
to being used as a spare instead (possibly as the grow hadn't
completed?).

How come it seems to have recovered so nicely?
Is there something I can do to check it's integrity?
Was it just so much quicker than 2 days because it switched to only
having to sort out the 1 disk? Would it be safe to run an fsck to check
the integrity of the fs?  I don't want to inadvertently blat the raid
array by 'using' it when it's in a dodgy state.

I have unmounted the drive for the time being, so that it doesn't get
any writes until I know what state 

Re: draft howto on making raids for surviving a disk crash

2008-02-06 Thread Michal Soltys

Keld Jørn Simonsen wrote:


Make each of the disks bootable by grub

(to be described)



It would probably be good to show how to use grub shell's install
command. It's the most flexible way and give the most (or rather total)
control. I could write some examples.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Purpose of Document? (was Re: draft howto on making raids for surviving a disk crash)

2008-02-06 Thread Moshe Yudkowsky
I read through the document, and I've signed up for a Wiki account so I 
can edit it.


One of the things I wanted to do was correct the title. I see that there 
are *three* different Wiki pages about how to build a system that boots 
from RAID. None of them are complete yet.


So, what is the purpose of this page? I think the purpose is a complete 
description of how to use RAID to build a system that not only boots 
from RAID but is robust against other hazards such as file system 
corruption.


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 If you pay peanuts, you get monkeys.
  Edward Yourdon, _The Decline and Fall of the American Progammer_
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk failure during grow, what is the current state.

2008-02-06 Thread Nagilum

- Message from [EMAIL PROTECTED] -
Date: Wed, 6 Feb 2008 12:58:55 -
From: Steve Fairbairn [EMAIL PROTECTED]
Reply-To: Steve Fairbairn [EMAIL PROTECTED]
 Subject: Disk failure during grow, what is the current state.
  To: linux-raid@vger.kernel.org



As you can see, one of the original 5 devices has failed (sdd1) and
automatically removed.  The reshape has stopped, but the new disk seems
to be in and clean which is the bit I don't understand.  The new disk
hasn't been added to the size, so it would seem that md has switched it
to being used as a spare instead (possibly as the grow hadn't
completed?).

How come it seems to have recovered so nicely?
Is there something I can do to check it's integrity?
Was it just so much quicker than 2 days because it switched to only
having to sort out the 1 disk? Would it be safe to run an fsck to check
the integrity of the fs?  I don't want to inadvertently blat the raid
array by 'using' it when it's in a dodgy state.

I have unmounted the drive for the time being, so that it doesn't get
any writes until I know what state it is really in.



- End message from [EMAIL PROTECTED] -

If a drive failes during reshape the reshape will just continue.
The blocks which were on the failed drive are calculated from the the  
other disks and writes to the failed disk are simply omitted.

The result is a raid5 with a failed drive.
You should get a new drive asap to restore the redundancy.
Also it's kinda important that you don't run 2.6.23 because it has a  
nasty bug which would be triggered in this scenario.
The reshape probably increased in speed after the system was no longer  
actively used and io bandwidth freed up.

Kind regards,
Alex.



#_  __  _ __ http://www.nagilum.org/ \n icq://69646724 #
#   / |/ /__  _(_) /_  _  [EMAIL PROTECTED] \n +491776461165 #
#  // _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
# /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
#   /___/ x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #




cakebox.homeunix.net - all the machine one needs..



pgpCS18uvCIqa.pgp
Description: PGP Digital Signature


FW: Disk failure during grow, what is the current state.

2008-02-06 Thread Steve Fairbairn

I'm having a nightmare with emails today.  I can't get a single one
right first time.  Apologies to Alex for sending it directly to him and
not to the list on first attempt.

Steve

 -Original Message-
 From: Steve Fairbairn [mailto:[EMAIL PROTECTED] 
 Sent: 06 February 2008 15:02
 To: 'Nagilum'
 Subject: RE: Disk failure during grow, what is the current state.
 
 
 
 
  -Original Message-
  From: [EMAIL PROTECTED]
  [mailto:[EMAIL PROTECTED] On Behalf Of Nagilum
  Sent: 06 February 2008 14:34
  To: Steve Fairbairn
  Cc: linux-raid@vger.kernel.org
  Subject: Re: Disk failure during grow, what is the current state.
  
  
  
  If a drive failes during reshape the reshape will just
  continue. The blocks which were on the failed drive are 
  calculated from the the  
  other disks and writes to the failed disk are simply omitted. 
  The result is a raid5 with a failed drive. You should get a 
  new drive asap to restore the redundancy. Also it's kinda 
  important that you don't run 2.6.23 because it has a  
  nasty bug which would be triggered in this scenario.
  The reshape probably increased in speed after the system was 
  no longer  
  actively used and io bandwidth freed up.
  Kind regards,
  Alex.
  
 
 Thanks for the response Alex, but
 
  Array Size : 1953535744 (1863.04 GiB 2000.42 GB)
   Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
 
 Surely the added disk should now been added to the Array 
 Size?  5 * 500GB is 2500GB, not 2000GB.  This is why I don't 
 think the reshape has continued.  As for speeding up because 
 of no IO badwidth, this also doesn't actually hold very true, 
 because the system was at a point of not being used anyway 
 before I added the disk, and I didn't unmount the drive until 
 this morning after it claimed it had finished doing anything.
 
 It's because the size doesn't match up to all 5 disks being 
 used that I still wonder at the state of the array.
 
 Steve.
 
 
 
 
 No virus found in this outgoing message.
 Checked by AVG Free Edition. 
 Version: 7.5.516 / Virus Database: 269.19.20/1261 - Release 
 Date: 05/02/2008 20:57
  
 

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.19.20/1261 - Release Date:
05/02/2008 20:57
 

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Purpose of Document? (was Re: draft howto on making raids for surviving a disk crash)

2008-02-06 Thread Keld Jørn Simonsen
On Wed, Feb 06, 2008 at 08:24:37AM -0600, Moshe Yudkowsky wrote:
 I read through the document, and I've signed up for a Wiki account so I 
 can edit it.
 
 One of the things I wanted to do was correct the title. I see that there 
 are *three* different Wiki pages about how to build a system that boots 
 from RAID. None of them are complete yet.
 
 So, what is the purpose of this page? I think the purpose is a complete 
 description of how to use RAID to build a system that not only boots 
 from RAID but is robust against other hazards such as file system 
 corruption.

You are right about that there are more than one wiki page addressing
very related issues. I also considered whether there was a need for the
new page, and discussed it with David.

And yes, my idea was to make a howto on building a system that can
survive a disk crash. A simple system that can also work for a
workstation. In fact the main audience is possibly here.

so my focus is: survive a failing disk, and keep it simple.

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: draft howto on making raids for surviving a disk crash

2008-02-06 Thread Keld Jørn Simonsen
On Wed, Feb 06, 2008 at 10:05:58AM +0100, Luca Berra wrote:
 On Sat, Feb 02, 2008 at 08:41:31PM +0100, Keld Jørn Simonsen wrote:
 Make each of the disks bootable by lilo:
 
   lilo -b /dev/sda /etc/lilo.conf1
   lilo -b /dev/sdb /etc/lilo.conf2
 There should be no need for that.
 to achieve the above effect with lilo you use
 raid-extra-boot=mbr-only
 in lilo.conf
 
 Make each of the disks bootable by grub
 install grub with the command
 grub-install /dev/md0

I have already changed the text on the wiki. Still I am not convinced it 
is the best advice that is described.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Disk failure during grow, what is the current state.

2008-02-06 Thread Steve Fairbairn
  -Original Message-
  From: Steve Fairbairn [mailto:[EMAIL PROTECTED]
  Sent: 06 February 2008 15:02
  To: 'Nagilum'
  Subject: RE: Disk failure during grow, what is the current state.
  
  
   Array Size : 1953535744 (1863.04 GiB 2000.42 GB)
Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
  
  Surely the added disk should now been added to the Array
  Size?  5 * 500GB is 2500GB, not 2000GB.  This is why I don't 
  think the reshape has continued.  As for speeding up because 
  of no IO badwidth, this also doesn't actually hold very true, 
  because the system was at a point of not being used anyway 
  before I added the disk, and I didn't unmount the drive until 
  this morning after it claimed it had finished doing anything.
  

Thanks again to Alex for his comments.  I've just rebooted the box, and
the reshape has continued on the degraded array and an RMA has been
raised for the faulty disk.

Thanks,

Steve.

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.19.20/1261 - Release Date:
05/02/2008 20:57
 

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-06 Thread Bill Davidsen

Neil Brown wrote:

On Sunday February 3, [EMAIL PROTECTED] wrote:
  

Hi,

Maybe I'll buy three HDDs to put a raid10 on them. And get the total
capacity of 1.5 of a disc. 'man 4 md' indicates that this is possible
and should work.

I'm wondering - how a single disc failure is handled in such configuration?

1. does the array continue to work in a degraded state?



Yes.

  

2. after the failure I can disconnect faulty drive, connect a new one,
   start the computer, add disc to array and it will sync automatically?




Yes.

  

Question seems a bit obvious, but the configuration is, at least for
me, a bit unusual. This is why I'm asking. Anybody here tested such
configuration, has some experience?


3. Another thing - would raid10,far=2 work when three drives are used?
   Would it increase the read performance?



Yes.

  

4. Would it be possible to later '--grow' the array to use 4 discs in
   raid10 ? Even with far=2 ?




No.

Well if by later you mean in five years, then maybe.  But the
code doesn't currently exist.
  


That's a reason to avoid raid10 for certain applications, then, and go 
with a more manual 1+0 or similar.


Can you create a raid10 with one drive missing and add it later? I 
know, I should try it when I get a machine free... but I'm being lazy today.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 or raid10 for /boot

2008-02-06 Thread Bill Davidsen

Keld Jørn Simonsen wrote:

I understand that lilo and grub only can boot partitions that look like
a normal single-drive partition. And then I understand that a plain
raid10 has a layout which is equivalent to raid1. Can such a raid10
partition be used with grub or lilo for booting?
And would there be any advantages in this, for example better disk
utilization in the raid10 driver compared with raid?
  


I don't know about you, but my /boot goes with zero use between boots, 
efficiency and performance improvements strike as a distinction without 
a difference, while adding complexity without benefit is always a bad idea.


I suggest that you avoid having a learning experience and stick with 
raid1.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-06 Thread Jon Nelson
On Feb 6, 2008 12:43 PM, Bill Davidsen [EMAIL PROTECTED] wrote:

 Can you create a raid10 with one drive missing and add it later? I
 know, I should try it when I get a machine free... but I'm being lazy today.

Yes you can. With 3 drives, however, performance will be awful (at
least with layout far, 2 copies).

IMO raid10,f2 is a great balance of speed and redundancy.
it''s faster than raid5 for reading, about the same for writing. it's
even potentially faster than raid0 for reading, actually.
With 3 disks one should be able to get 3.0 times the speed of one
disk, or slightly more, and each stripe involves only *one* disk
instead of 2 as it does with raid5.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Bill Davidsen

Keld Jørn Simonsen wrote:

Hi

I am looking at revising our howto. I see a number of places where a
chunk size of 32 kiB is recommended, and even recommendations on
maybe using sizes of 4 kiB. 

  
Depending on the raid level, a write smaller than the chunk size causes 
the chunk to be read, altered, and rewritten, vs. just written if the 
write is a multiple of chunk size. Many filesystems by default use a 4k 
page size and writes. I believe this is the reasoning behind the 
suggestion of small chunk sizes. Sequential vs. random and raid level 
are important here, there's no one size to work best in all cases.
My own take on that is that this really hurts performance. 
Normal disks have a rotation speed of between 5400 (laptop)

7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average
spinning time for one round of 6 to 12 ms, and average latency of half
this, that is 3 to 6 ms. Then you need to add head movement which
is something like 2 to 20 ms - in total average seek time 5 to 26 ms,
averaging around 13-17 ms. 

  
Having a write not some multiple of chunk size would seem to require a 
read-alter- wait_for_disk_rotation-write, and for large sustained 
sequential i/o using multiple drives helps transfer. for small random 
i/o small chunks are good, I find little benefit to chunks over 256 or 
maybe 1024k.
in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 
something like between 600 to 1200 kB, actual transfer rates of

80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck,
and transfer some data you should have something like 256/512 kiB
chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB
giving about a time of 20 ms per transaction
you should be able with random reads to transfer 12 MB/s  - my
actual figures is about 30 MB/s which is possibly because of the
elevator effect of the file system driver. With a size of 4 kb per chunk 
you should have a time of 15 ms per transaction, or 66 transactions per 
second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up
the transfer by a factor of 50. 

  
If you actually see anything like this your write caching and readahead 
aren't doing what they should!



I actually  think the kernel should operate with block sizes
like this and not wth 4 kiB blocks. It is the readahead and the elevator
algorithms that save us from randomly reading 4 kb a time.

  

Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk.

I also see that there are some memory constrints on this.
Having maybe 1000 processes reading, as for my mirror service,
256 kib buffers would be acceptable, occupying 256 MB RAM.
That is reasonable, and I could even tolerate 512 MB ram used.
But going to 1 MiB buffers would be overdoing it for my configuration.

What would be the recommended chunk size for todays equipment?

  

I think usage is more important than hardware. My opinion only.


Best regards
Keld



--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re[4]: mdadm 2.6.4 : How i can check out current status of reshaping ?

2008-02-06 Thread Andreas-Sokov
Hello, Neil.

.
 Possible you have bad memory, or a bad CPU, or you are overclocking
 the CPU, or it is getting hot, or something.

As seems to me all my problems has been started after i have started update 
MDADM.
This is server worked normaly (but only not like soft-raid) more 2-3 years.
Last 6 months it worked as soft-raid. All was normaly, Even I have added 
successfully
4th hdd into raid5 )when it stared was 3 hdd). And then Reshaping have been 
passed fine.

Yesterday i have did memtest86 onto it server and 10 passes was WITH OUT any 
errors.
Temperature of server is about 25 grad celsius.
No overlocking, all set to default.

Realy i do not know what to do because off wee nedd grow our storage, and we 
can not.
unfortunately, At this moment - Mdadm do not help us in this decision, but very 
want
it get.

 But you clearly have a hardware error.

 NeilBrown



-- 
Best regards,
Andreas-Sokov

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Wolfgang Denk
In message [EMAIL PROTECTED] you wrote:

  I actually  think the kernel should operate with block sizes
  like this and not wth 4 kiB blocks. It is the readahead and the elevator
  algorithms that save us from randomly reading 4 kb a time.
 

 Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk.

Indeed kernel page size is an important factor in such optimizations.
But you have to keep in mind that this is mostly efficient for (very)
large strictly sequential I/O operations only -  actual  file  system
traffic may be *very* different.

We implemented the option to select kernel page sizes of  4,  16,  64
and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
graphics of the effect can be found here:

https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH, MD: Wolfgang Denk  Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: [EMAIL PROTECTED]
You got to learn three things. What's  real,  what's  not  real,  and
what's the difference.   - Terry Pratchett, _Witches Abroad_
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-06 Thread Bill Davidsen

Jon Nelson wrote:
On Feb 6, 2008 12:43 PM, Bill Davidsen [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED] wrote:


Can you create a raid10 with one drive missing and add it later? I
know, I should try it when I get a machine free... but I'm being
lazy today.


Yes you can. With 3 drives, however, performance will be awful (at 
least with layout far, 2 copies).



Well, the question didn't include being fast. ;-)

But if he really wants to create the array now and be able to add to it 
later, it might still be useful, particularly if later is a small time 
like when my other drive ships. Thanks for the input, I thought that 
was possible, but reading code isn't the same as testing.

IMO raid10,f2 is a great balance of speed and redundancy.
it''s faster than raid5 for reading, about the same for writing. it's 
even potentially faster than raid0 for reading, actually.
With 3 disks one should be able to get 3.0 times the speed of one 
disk, or slightly more, and each stripe involves only *one* disk 
instead of 2 as it does with raid5.


I have used raid10 swap on 3 or more drives fairly often. Other than the 
Fedora rescue CD not using the space until I start it manually, I find 
it really fast, and helpful for huge image work.


--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Bill Davidsen

Wolfgang Denk wrote:

In message [EMAIL PROTECTED] you wrote:
  

I actually  think the kernel should operate with block sizes
like this and not wth 4 kiB blocks. It is the readahead and the elevator
algorithms that save us from randomly reading 4 kb a time.

  
  

Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk.



Indeed kernel page size is an important factor in such optimizations.
But you have to keep in mind that this is mostly efficient for (very)
large strictly sequential I/O operations only -  actual  file  system
traffic may be *very* different.

  
That was actually what I meant by page size, that of the file system 
rather than the memory, ie. the block size typically used for writes. 
Or multiples thereof, obviously.

We implemented the option to select kernel page sizes of  4,  16,  64
and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
graphics of the effect can be found here:

https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

  
I started that online and pulled a download to print, very neat stuff. 
Thanks for the link.

Best regards,

Wolfgang Denk

  



--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-06 Thread Janek Kozicki
Bill Davidsen said: (by the date of Wed, 06 Feb 2008 13:16:14 -0500)

 Janek Kozicki wrote:
  Justin Piszcz said: (by the date of Tue, 5 Feb 2008 17:28:27 -0500 
  (EST))
  writing on raid10 is supposed to be half the speed of reading. That's
  because it must write to both mirrors.
 

 ??? Are you assuming that write to mirrored copies are done sequentially 
 rather than in parallel? Unless you have enough writes to saturate 
 something the effective speed approaches the speed of a single drive. I 
 just checked raid1 and raid5, writing 100MB with an fsync at the end. 
 raid1 leveled off at 85% of a single drive after ~30MB.

Hi,

In above context I'm talking about raid10 (not about raid1, raid0,
raid0+1, raid1+0, raid5 or raid6).

Of course writes are done in parallel. When each chunk has two
copies raid10 reads twice as fast as it writes. 

If each chunk has three copies, then writes are 1/3 speed of reading.
If each chunk has number of copies equal to number of drives, then
write speed drops down to that of a single drive - a 1/Nth of read speed.

But it's all just a theory. I'd like to see more benchmarks :-)

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm 2.6.4 : How i can check out current status of reshaping ?

2008-02-06 Thread Janek Kozicki
Andreas-Sokov said: (by the date of Wed, 6 Feb 2008 22:15:05 +0300)

 Hello, Neil.
 
 .
  Possible you have bad memory, or a bad CPU, or you are overclocking
  the CPU, or it is getting hot, or something.
 
 As seems to me all my problems has been started after i have started update 
 MDADM.

what is the update?

- you installed a new version of mdadm?
- you installed new kernel?
- something else?

- what was the version before, and what version is now?

- can you downgrade to previous version?


best regards
-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 or raid10 for /boot

2008-02-06 Thread Keld Jørn Simonsen
On Wed, Feb 06, 2008 at 01:52:11PM -0500, Bill Davidsen wrote:
 Keld Jørn Simonsen wrote:
 I understand that lilo and grub only can boot partitions that look like
 a normal single-drive partition. And then I understand that a plain
 raid10 has a layout which is equivalent to raid1. Can such a raid10
 partition be used with grub or lilo for booting?
 And would there be any advantages in this, for example better disk
 utilization in the raid10 driver compared with raid?
   
 
 I don't know about you, but my /boot goes with zero use between boots, 
 efficiency and performance improvements strike as a distinction without 
 a difference, while adding complexity without benefit is always a bad idea.
 
 I suggest that you avoid having a learning experience and stick with 
 raid1.

I agree with you, it was only a theoretical question.

Best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Keld Jørn Simonsen
On Wed, Feb 06, 2008 at 09:25:36PM +0100, Wolfgang Denk wrote:
 In message [EMAIL PROTECTED] you wrote:
 
   I actually  think the kernel should operate with block sizes
   like this and not wth 4 kiB blocks. It is the readahead and the elevator
   algorithms that save us from randomly reading 4 kb a time.
  
 
  Exactly, and nothing save a R-A-RW cycle if the write is a partial chunk.
 
 Indeed kernel page size is an important factor in such optimizations.
 But you have to keep in mind that this is mostly efficient for (very)
 large strictly sequential I/O operations only -  actual  file  system
 traffic may be *very* different.
 
 We implemented the option to select kernel page sizes of  4,  16,  64
 and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
 graphics of the effect can be found here:
 
 https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

Yes, that is also what I would expect, for sequential reads.
Random writes of small data blocks, kind of what is done in bug data
bases, should show another picture as others also have described.

If you look at a single disk, would you get improved performance with
the asyncroneous IO?

I am a bit puzzled about my SATA-II performance: nominally I could get
300 MB/s on SATA-II, but I only get about 80 MB/s. Why is that?
I thought it was because of latency with syncroneous reads.
Ie, when a chunk is read, yo need to complete the IO operation, and then
issue an new one. In the meantime while the CPU is doing these
calculations, te disk has spun a little, and to get the next data chunk,
we need to wait for the disk to spin around to have the head positioned 
over the right data pace on the disk surface. Is that so? Or does the
controller take care of this, reading the rest of the not-yet-requested
track into a buffer, which then can be delivered next time. Modern disks
often have buffers of about 8 or 16 MB. I wonder why they don't have
bigger buffers.

Anyway, why does a SATA-II drive not deliver something like 300 MB/s?

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-06 Thread Neil Brown
On Wednesday February 6, [EMAIL PROTECTED] wrote:
 

  4. Would it be possible to later '--grow' the array to use 4 discs in
 raid10 ? Even with far=2 ?
 
  
 
  No.
 
  Well if by later you mean in five years, then maybe.  But the
  code doesn't currently exist.

 
 That's a reason to avoid raid10 for certain applications, then, and go 
 with a more manual 1+0 or similar.

Not really.  You cannot reshape a raid0 either.

 
 Can you create a raid10 with one drive missing and add it later? I 
 know, I should try it when I get a machine free... but I'm being lazy today.

Yes, but then the array would be degraded and a single failure could
destroy your data.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-06 Thread Neil Brown
On Wednesday February 6, [EMAIL PROTECTED] wrote:
 
 % cat /proc/partitions
 major minor  #blocks  name
 
8 0  390711384 sda
8 1  390708801 sda1
816  390711384 sdb
817  390708801 sdb1
832  390711384 sdc
833  390708801 sdc1
848  390710327 sdd
849  390708801 sdd1
864  390711384 sde
865  390708801 sde1
880  390711384 sdf
881  390708801 sdf1
364   78150744 hdb
3651951866 hdb1
3667815622 hdb2
3674883760 hdb3
368  1 hdb4
369 979933 hdb5
370 979933 hdb6
371   61536951 hdb7
9 1  781417472 md1
9 0  781417472 md0

So all the expected partitions are known to the kernel - good.

 
 /etc/udev/rules.d % cat /proc/mdstat
 Personalities : [raid1] [raid6] [raid5] [raid4]
 md0 : active(auto-read-only) raid5 sdc1[0] sde1[3](S) sdd1[1]
   781417472 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
 
 md1 : active(auto-read-only) raid5 sdf1[0] sdb1[3](S) sda1[1]
   781417472 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
 
 md0 consists of sdc1, sde1 and sdd1 even though when creating I asked it to 
 use d_1, d_2 and d_3 (this is probably written on the particular 
 disk/partition itself,
 but I have no idea how to clean this up - mdadm --zero-superblock /dev/d_1
 again produces mdadm: Couldn't open /dev/d_1 for write - not zeroing)
 

I suspect it is related to the (auto-read-only).
The array is degraded and has a spare, so it wants to do a recovery to
the spare.  But it won't start the recovery until the array is not
read-only.

But the recovery process has partly started (you'll see an md1_resync
thread) so it won't let go of any fail devices at the moment.
If you 
  mdadm -w /dev/md0

the recovery will start.
Then
  mdadm /dev/md0 -f /dev/d_1

will fail d_1, abort the recovery, and release d_1.

Then
  mdadm --zero-superblock /dev/d_1

should work.

It is currently failing with EBUSY - --zero-superblock opens the
device with O_EXCL to ensure that it isn't currently in use, and as
long as it is part of an md array, O_EXCL will fail.
I should make that more explicit in the error message.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Iustin Pop
On Thu, Feb 07, 2008 at 01:31:16AM +0100, Keld Jørn Simonsen wrote:
 Anyway, why does a SATA-II drive not deliver something like 300 MB/s?

Wait, are you talking about a *single* drive?

In that case, it seems you are confusing the interface speed (300MB/s)
with the mechanical read speed (80MB/s). If you are asking why is a
single drive limited to 80 MB/s, I guess it's a problem of mechanics.
Even with NCQ or big readahead settings, ~80-~100 MB/s is the highest
I've seen on 7200 RPM drives. And yes, there is no wait until the CPU
processes the current data until the drive reads the next data; drives
have a builtin read-ahead mechanism.

Honestly, I have 10x as many problems with the low random I/O throughput
rather than with the (high, IMHO) sequential I/O speed.

regards,
iustin
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown
On Wednesday February 6, [EMAIL PROTECTED] wrote:
 
 We implemented the option to select kernel page sizes of  4,  16,  64
 and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
 graphics of the effect can be found here:
 
 https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

Thanks for the link!

quote
The second improvement is to remove a memory copy that is internal to the MD 
driver. The MD
driver stages strip data ready to be written next to the I/O controller in a 
page size pre-
allocated buffer. It is possible to bypass this memory copy for sequential 
writes thereby saving
SDRAM access cycles.
/quote

I sure hope you've checked that the filesystem never (ever) changes a
buffer while it is being written out.  Otherwise the data written to
disk might be different from the data used in the parity calculation
:-)

And what are the Second memcpy and First memcpy in the graph?
I assume one is the memcpy mentioned above, but what is the other?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown
On Wednesday February 6, [EMAIL PROTECTED] wrote:
 Keld Jørn Simonsen wrote:
  Hi
 
  I am looking at revising our howto. I see a number of places where a
  chunk size of 32 kiB is recommended, and even recommendations on
  maybe using sizes of 4 kiB. 
 

 Depending on the raid level, a write smaller than the chunk size causes 
 the chunk to be read, altered, and rewritten, vs. just written if the 
 write is a multiple of chunk size. Many filesystems by default use a 4k 
 page size and writes. I believe this is the reasoning behind the 
 suggestion of small chunk sizes. Sequential vs. random and raid level 
 are important here, there's no one size to work best in all cases.

Not in md/raid.

RAID4/5/6 will do a read-modify-write if you are writing less than one
*page*, but then they often to read-modify-write anyway for parity
updates.

No level will every read a whole chunk just because it is a chunk.

To answer the original question:  The only way to be sure is to test
your hardware with your workload with different chunk sizes.
But I suspect that around 256K is good on current hardware.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown
On Thursday February 7, [EMAIL PROTECTED] wrote:
 
 Anyway, why does a SATA-II drive not deliver something like 300 MB/s?


Are you serious?

I high end 15000RPM enterprise grade drive such as the Seagate
Cheetah® 15K.6 Hard Drives only deliver 164MB/sec.

The SATA Bus might be able to deliver 300MB/s, but an individual drive
would be around 80MB/s unless it is really expensive.

(or was that yesterday?  I'm having trouble keeping up with the pace
 of improvement :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html