Re: [Lustre-discuss] recovery from multiple disks failure on the same md

Adrian Ulrich Mon, 07 May 2012 12:18:21 -0700

Hi,


> A OST (raid 6: 8+2, spare 1) had 2 disk failures almost at the same time. 
> While recovering it, another disk failed. so recovering procedure seems to be 
> halt,

So did the md-array stop itself on the 3th disk failure (or at least turn 
read-only)?

If it did you might be able to get it running again without catastrophic 
corruption.


This is what i would try (without any warranty!):


 -> Forget about the 2 syncing spares

 -> Take the 3th failed disk and attach it to some pc

 -> Copy as much data as possible to a new spare using dd_rescue
    (-r might help)

 -> Put the drive with the fresh copy (= the good, new drive) into the array 
and assemble + start it.
    Use --force if mdadm complains about outdated metadata.
    (and starting it as 'readonly' for now would also be a good idea)

 -> Add a new spare to the array and sync it as fast as possible to get at 
least 1 parity disk.

 -> Run 'fsck -n /dev/mdX' to see how badly damaged your filesystem is.
    If you think that fsck can fix the errors (and will not cause more 
damadge), run it without '-n'

 -> Add the 2nd parity disk, sync it, mount the filesystem and pray.


The amount of data corruption will be linked to the success of dd_rescue: You 
are probably lucky if it only failed to read a few sectors.


And i agree with Kevin:

If you have a support contract: ask them to fix it.
(..and if you have enough hardware + time: create a backup of ALL drives in the 
failed raid via 'dd' before touching anything!)


I'd also recommend to start periodic scrubbing: We do this once per month with 
low priority (~5MBPS) with little impact to the users.


Regards and good luck,
 Adrian
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] recovery from multiple disks failure on the same md

Reply via email to