On 11/30/2011 2:01 AM, Ulrich Windl wrote: >>>> Dimitri Maziuk<[email protected]> schrieb am 29.11.2011 um 19:36 in > Nachricht<[email protected]>: >> On 11/29/2011 07:49 AM, Lars Marowsky-Bree wrote: >> >>> (But the mdadm operations the RA does also shouldn't cause data >>> corruption. That strikes me as an MD bug.) >> >> If you repeatedly try to re-sync with a dying disk, with each resync >> interrupted by i/o error, you will get data corruption sooner or later. >> It's only MD bug in a sense that MD can't actually stop you from >> shooting yourself. > > I'd like to know more details: Which disk has an I/O error: source > or destination of the sync. How is data corruption created?
Well that's the point: if you have 2 disks, and neither has failed yet, how do you pick the one that isn't failing? Specific failure mode I'm talking about is busy relocating bad sectors. Until the SMART counter hits the threshold value it's "not failed", but you'll see sata timeouts/resets in /var/log/messages with spiking i/o wait and those "all sorts of hangs" Lars mentioned. If mdadm decides to use that disk as the source, you have a race: either SMART will fail the disk before it starts dropping bits or develops an unrelocatable bad sector, or said bad sectors will get copied to the mirror disk. Granted, I've only seen data corruption on sata raid-1 once so far. But once is enough. (Rumour has it, it's worse with raid-5 since that only protects from data loss if all chunks are committed to disk at once and not stuck in a write cache waiting for the elevator.) Dima _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
