Hi, Les Mikesell wrote on 2011-04-28 23:15:52 -0500 [Re: [BackupPC-users] RAID and offsite]: > On 4/28/11 9:50 PM, Holger Parplies wrote: > > I'm sure that's a point where we'll all disagree with each other :-). > > > > Personally, I wouldn't use a common set of disks for normal backup operation > > and offsite backups. [...] > > I don't think there is anything predictable about disk failure. Handling > them is probably bad. Normal (even heavy) use doesn't seem to matter > unless maybe they overheat.
well, age does matter at *some* point, as does heat. Unless you proactively replace the disks before that point is reached, they will likely all be "old" when the first one fails. Sure, if the first disk fails after a few months, the others will likely be ok (though I've had a set of 15 identical disks where about 10 failed within the first 2 years). > > [...] I think it brought up the *wrong* (i.e. faulty) disk of the mirror and > > failed on an fsck. [...] > > Grub doesn't know about raid and just happens to work with raid1 because it > treats the disk as a single drive. What's more, grub doesn't know about fsck. grub found and booted a kernel. The kernel then decided that its root FS on /dev/md0 consisted of the wrong mirror (or maybe its LVM PV on /dev/md1; probably both). grub and the BIOS have no part in that decision. I can see that the remaining drive may fail to boot (which it didn't), but I *can't* see why an array should be started in degraded mode on the *defective* mirror when both are present. > And back in IDE days, a drive failure usually locked the controller which > might have had another drive on the same cable. Totally unrelated, but yes. SATA in my case anyway. > > I *have* seen RAID members dropped from an array without understandable > > reasons, but, mostly, re-adding them simply worked [...] > > I've seen that too. I think retries are much more aggressive on single > disks or the last one left in a raid than on the mirror. Yes, but a retry needs a read error first. Are retries on single disks always logged or only on failure? Or perhaps I should ask this: are retries uncommon enough to warrant failing array members, yet common enough that a disk that has produced one can still be trustworthy? How do you handle disks where you see that happen? Replace or retry? > > [...] there are no guarantees your specific software/kernel/driver/hardware > > combination will not trigger some unknown (or unfixed ;-) bug. > > I had a machine with a couple of 4-year uptime runs (a red hat 7.3) where > several of the scsi drives failed and were hot-swapped and re-synced with no > surprises. So unless something has broken in the software recently, I mostly > trust it. You mean, your RH 7.3 machine had all software/kernel/driver/hardware combinations that there are? Like I said, I've seen (and heard of) strange occurrences, yet, like you, I mostly trust the software, simply out of lack of choice. I *can't* verify its correct operation; I could only try to reproduce incorrect operation, were I to notice it. When something strange happens, I mostly attribute it to user errors, bugs in file system code, hardware errors (memory or power supply). RAID software errors are last on my mind. In any case, the benefits seem to outweigh the doubts. Yet there remain these few strange occurrences, which may or may not be RAID-related. On average, every few thousand years, a CPU will randomly compute an incorrect result for some operation for whatever reason. That is unlikely enough that any single one of us is extremely unlikely to ever be affected. But there are enough computers around that it does happen on a daily basis. Most of the time, the effect is probably benign (random mouse movement, one incorrect sample in an audio stream, another Windoze bluescreen, whatever). It might as well be RAID weirdness in one case. Or the RAID weirdness may be the result of an obscure bug. Complex software *does* contain bugs, you know. > > It *would* help to understand how RAID event counts and the Linux RAID > > implementation in general work. Has anyone got any pointers to good > > documentation? > > I've never seen it get this wrong when auto-assembling at reboot (and I move > disks around frequently and sometimes clone machines by splitting the mirrors > into different machines), but it shouldn't matter in the BPC scenario because > you are always manually telling it which partition to add to an already > running array. That doesn't exactly answer my question, but I'll take it as a "no, I don't". Yes, I *did* mention that, I believe, but if your 2 TB resync doesn't complete before reboot/power failure, then you exactly *don't* have a rebuild initiated by an 'md --add'; after reboot, you have an auto-assembly (I also mentioned that). And, also agreed, I've also never ***seen*** it get this wrong when auto-assembling at reboot (well, except for once, but let's even ignore that). My point is that auto-assembly normally takes two (or more) mirrors that are either synchronized (normal shutdown) or at least nearly so (crash). What we are talking about here is adding a member that might be days, months, or even years out of date, with an arbitrary number of alternate members having been active in between. I don't know if the RAID implementation was designed with this usage pattern in mind. Is there a wrap-around for event counters? On what basis are they incremented? How does the software detect which member is more up-to-date after a crash? I'm not saying it doesn't work. I'm asking how it works so I can draw my own conclusions. That is what "Open Source" means, right? And that question is slightly off-topic on this list, so I didn't go into detail before. Regards, Holger ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/