On 23/02/2012 10:41 AM, Nico Kadel-Garcia wrote:
On Wed, Feb 22, 2012 at 4:38 PM, Bill Maidment <[email protected]> wrote:

     > In (1) above, are they replying that you can't "--fail", "--remove",
     > and then "--add" the same disk or that you can't "--fail" and
     > "--remove" a disk, replace it, and then can't "--add" it because it's
     > got the same "X"/"XY" in "sdX"/"sdaXY" as the previous, failed disk?
     >
     >

    Now I've had my coffee fix I have got back my sanity.
    I have used the following sequence of commands to remove and re-add
    a disk to a running RAID1 array:
    mdadm /dev/md3 -f /dev/sdc1
    mdadm /dev/md3 -r /dev/sdc1
    mdadm --zero-superblock /dev/sdc1
    mdadm /dev/md3 -a /dev/sdc1

    It works as expected. I just found the original error message a bit
    confusing when it referred to making the disk a "spare". It would
    seem that earlier versions of the kernel did that automatically.

Interesting! I have mixed feeling about RAID, especially for simple
RAID1 setups. I'd rather use the second drive as an rsnapshot based
backup drive, usueally in read-only mode. That allows me to recover
files that I've accidentally screwed up or deleted in the recent past,
which occurs far more often than drive failures. And it puts different
wear and tear on the hard drive: there's nothing like having all the
drives in a RAID set start failing at almost the same time, before drive
replacement can occur. This has happened to me before and is actually
pretty well described in a Google white paper at
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf.
However, in this case, I'd tend to agreee with the idea that a RAID1
pair should not be automatically re-activated on reboot. If one drive
starts failing, it should be kept offline until replaced, and trying to
outguess this process at boot time without intervention seems fraught.

I think the more important part here is that with RAID1 under the linux kernel is that if ANY error is reported - or even a mismatched sector in the general day to day activity, the implemented fix is to randomly guess which mirror is correct.

This can easily lead to good data being overwritten with bad.

I also agree that this is a general problem with RAID1 - as there is no parity to check, the correct value cannot be computed in any way, shape of form - leaving the only sane way to fix the problem being a guess.

The theory is that the filesystem will notice the error and fix it - although you may well end up with a block of corrupted data as a result. The only theoretical way to fix this would be to add parity - and then you're in the RAID5/6 area....

--
Steven Haigh

Email: [email protected]
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to