Re: failed raid1 drive

Craig Falconer Wed, 21 Oct 2009 21:17:21 -0700

I agree with Steve - your best bet is to treat this as a warning andexpect to replace both drives in the near future.


<!>  Why is md2 still a valid raidset?  that's odd.

The RAID has succeeded in allowing you to plan this change, rather thanlosing access to your files.

Since its a software raid, I suggest you plan on buying two new largerdrives.


First try Steve's idea and just readd the drive to the array.
It shouldn't hurt anything.

Then two ways to progress....
0       Boot in single user mode

1 Add one new drive to the machine, partition it with similar butlarger partitions as appropriate.

2       Then use
                mdadm --add /dev/md3 /dev/sdb4
                mdadm --add /dev/md2 /dev/sdb3
                mdadm --add /dev/md1 /dev/sdb2
                mdadm --add /dev/md0 /dev/sdb1
                sysctl -w dev.raid.speed_limit_max=99999999
3       While this is happening run
                watch --int 10 cat /proc/mdstat
        Wait until all the drives are synched

4 If you boot off this raidset you'll need to reinstall a boot loader oneach drive

5       Down the machine and remove the last 320 GB drive.
6       Install the other new drive, then boot.
7       Partition the other new drive the same as the first big drive
8       Repeat steps 2 and 3 but use sda rather than sdb

Once they're finished synching you can grow your filesystems to theirfull available space

9       Do the boot loader install onto both drives again
10      Then you can reboot and it should all be good.

The other way of doing it is to:
1       Add both new drives
2       Create a new single md device

3 Create a PV and add it to a VG then create individual LVs as large asyou want. Leave some spare space and you can grow individual LVs later

4       Use something like rsync to copy the files from the old md to the new 
LVs
5       Enjoy.

The old 320 that still works can be relegated to a windows box orsomething else thats unimportant. It could work for another 10 years orit could fail tomorrow... who would know.


steve wrote, On 22/10/09 16:26:

On Thu, 2009-10-22 at 16:00 +1300, Roger Searle wrote:
Hi, I noticed by chance that I have a failed drive in a raid1 array on afile server that I need to replace, and seeking some guidance orconfirmation of being on the right track to resolve this. Seems fromthe failure of more than 1 partition I will be needing to buy a new diskrather than any repair option being possible, I may as well get a newpair but replace the failed disk first, then when resolved replace theother. Yes, I have backups of the valuable data on other drives both inthe same machine (not in this array) and elsewhere. And I then need toset up better monitoring because the failure began a few weeks ago. Butfor now...
There are so many levels of electronics that you go through to get to
the platter there days that if you see even a single hard error, then
now's a good time to use it only for skeet shooting...
The failed disk is 320GB, and contain (mirrored) /, home, and swap.Presumably I could buy much larger disks, and need to repartition priorto adding it back into the array?
Best to use the same make/model of disk if possible. Speed differences
between the two can make it unreliable ( that's an exaggeration, but you
know what I mean ).
The partitions should be at least the same size but could be much largerwithout any problem?
If you want to add more space, then I'd buy a new pair of bigger disks,
create a new set, and copy everything across. Reason I'm saying that is
that your 2 existing disks are probably exactly the same make/model,
with similar serial numbers??? Guess what's going to fail next (:

Last time I looked, 1TB was around the $125 mark.
There is some configuration data in mdadm.conf including UUIDs of thearrays, and this doesn't match with the UUIDs in fstab, do I need to beconcerned about this sort of thing and can just use mdadm or other toolsto rebuild the arrays and that will update any relevant config files?
mdadm.conf is pretty redundant I think. They tend to be automagically
configured at boot time these days. Building a new raid array *should*
add the correct data to it. Although I have a grand old time with a
hardy server of mine in this respect.
Is there anything else I should be looking out for or preparing?
Don't forget to add a bootstrap to each new disk if this is going to
contain the boot partition as well.
Thanks for any pointers anyone may care to share.
You could try
mdadm --add /dev/md3 /dev/sdb4

and see whether it resilvers. Looking in dmesg for hard errors is the
best place.

hth,

Steve
A couple of examples of DegradedArray and Fail Event emails to rootrecently follow:
To: [email protected]
Subject: Fail event on /dev/md1:jupiter
Date: Wed, 21 Oct 2009 17:42:49 +1300

This is an automatically generated mail message from mdadm
running on jupiter

A Fail event had been detected on md device /dev/md1.

It could be related to component device /dev/sdb2.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5][raid4] [raid10]
md3 : active raid1 sda4[0]
      290977216 blocks [2/1] [U_]
md2 : active raid1 sda3[0] sdb3[1]
      104320 blocks [2/2] [UU]
md1 : active raid1 sda2[0] sdb2[2](F)
      1951808 blocks [2/1] [U_]
md0 : active raid1 sda1[0]
      19534912 blocks [2/1] [U_]
unused devices: <none>
Subject: DegradedArray event on /dev/md3:jupiter
Date: Wed, 07 Oct 2009 08:26:49 +1300

This is an automatically generated mail message from mdadm
running on jupiter

A DegradedArray event had been detected on md device /dev/md3.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5][raid4] [raid10]
md3 : active raid1 sda4[0]
      290977216 blocks [2/1] [U_]
md2 : active raid1 sda3[0] sdb3[1]
      104320 blocks [2/2] [UU]
md1 : active raid1 sda2[0] sdb2[1]
      1951808 blocks [2/2] [UU]
md0 : active raid1 sda1[0]
      19534912 blocks [2/1] [U_]
unused devices: <none>
Cheers,
Roger



--
Craig Falconer
  The Total Team - Managed Systems
  Office: 0800 888 326 / +643 974 9128
  Email: [email protected]
  Web: http://www.totalteam.co.nz/

Re: failed raid1 drive

Reply via email to