Hi,
> A OST (raid 6: 8+2, spare 1) had 2 disk failures almost at the same time.
> While recovering it, another disk failed. so recovering procedure seems to be
> halt,
So did the md-array stop itself on the 3th disk failure (or at least turn
read-only)?
If it did you might be able to get it running again without catastrophic
corruption.
This is what i would try (without any warranty!):
-> Forget about the 2 syncing spares
-> Take the 3th failed disk and attach it to some pc
-> Copy as much data as possible to a new spare using dd_rescue
(-r might help)
-> Put the drive with the fresh copy (= the good, new drive) into the array
and assemble + start it.
Use --force if mdadm complains about outdated metadata.
(and starting it as 'readonly' for now would also be a good idea)
-> Add a new spare to the array and sync it as fast as possible to get at
least 1 parity disk.
-> Run 'fsck -n /dev/mdX' to see how badly damaged your filesystem is.
If you think that fsck can fix the errors (and will not cause more
damadge), run it without '-n'
-> Add the 2nd parity disk, sync it, mount the filesystem and pray.
The amount of data corruption will be linked to the success of dd_rescue: You
are probably lucky if it only failed to read a few sectors.
And i agree with Kevin:
If you have a support contract: ask them to fix it.
(..and if you have enough hardware + time: create a backup of ALL drives in the
failed raid via 'dd' before touching anything!)
I'd also recommend to start periodic scrubbing: We do this once per month with
low priority (~5MBPS) with little impact to the users.
Regards and good luck,
Adrian
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss