Re: [Linux-HA] Antw: Re: Q: unmanaged MD-RAID & auto-recovery

Dimitri Maziuk Wed, 30 Nov 2011 06:56:16 -0800

On 11/30/2011 2:01 AM, Ulrich Windl wrote:
>>>> Dimitri Maziuk<[email protected]>  schrieb am 29.11.2011 um 19:36 in
> Nachricht<[email protected]>:
>> On 11/29/2011 07:49 AM, Lars Marowsky-Bree wrote:
>>
>>> (But the mdadm operations the RA does also shouldn't cause data
>>> corruption. That strikes me as an MD bug.)
>>
>> If you repeatedly try to re-sync with a dying disk, with each resync
>> interrupted by i/o error, you will get data corruption sooner or later.
>> It's only MD bug in a sense that MD can't actually stop you from
>> shooting yourself.
>
> I'd like to know more details: Which disk has an I/O error: source
> or
destination of the sync. How is data corruption created?


Well that's the point: if you have 2 disks, and neither has failed yet, 
how do you pick the one that isn't failing?

Specific failure mode I'm talking about is busy relocating bad sectors. 
Until the SMART counter hits the threshold value it's "not failed", but 
you'll see sata timeouts/resets in /var/log/messages with spiking i/o 
wait and those "all sorts of hangs" Lars mentioned. If mdadm decides to 
use that disk as the source, you have a race: either SMART will fail the 
disk before it starts dropping bits or develops an unrelocatable bad 
sector, or said bad sectors will get copied to the mirror disk.

Granted, I've only seen data corruption on sata raid-1 once so far. But 
once is enough.

(Rumour has it, it's worse with raid-5 since that only protects from 
data loss if all chunks are committed to disk at once and not stuck in a 
write cache waiting for the elevator.)

Dima
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Q: unmanaged MD-RAID & auto-recovery

Reply via email to