On Sun, Mar 04, 2012 at 08:29:49PM -0600, Elijah Wright wrote: > Some folks were discussing MTBF on chassis Friday - does anybody know > if there are actual MTBF numbers for chassis failures on > Dell gear (that are not simply made up...)? We mostly see "lots and > lots of drive failures" on RAID arrays, and those are pretty easily > actionable - the number of total chassis failures is so low that I > have trouble tracking it. ;)
Chassis/backplane failures are rare. I know not too long ago I got a double raid failure that seemed to go away after a reboot (one of the disks was really bad, one looked just fine after the reboot) but we were getting I/O errors on both, and I'm not going to just leave a drive in production that was giving me I/O errors, so I called it a backplane error and replaced the backplane. (actually, I swapped out the whole goddamn chassis, put the good drive and a fresh drive in the new chassis and rebuilt the raid, as I generally try to minimize the amount of time spent fucking around when production is down) I haven't seen an error on the first drive since. I ended up replacing the backplane and all the sata cables on the chassis that had the issue, and it passed a long and hard burn-in, and has been production on something else for around a year now, without problems. Double disk failures... frighten me. But overall? backplanes seem way more reliable than they have any right to be. It's very rare that I see anything that even smells like a backplane failure, and then, it's usually something like this where I can't prove anything. > One place that I worked a few years ago, we got within striking > distance of having our Nagios install able to > open/close/escalate/self-heal a variety of very common issues. It's > *great* when you get to that point - it makes your life not suck. > > > One thing I want is a phone call that won't stop calling until I > > press a number to ack it, or something. > > That's PagerDuty. I think you'd like it. [When it devolves to me > doing free PR for pagerduty - color me impressed with their service.] > > I think I've seen it 'lag' only one time - the rest of the alerting > has been incredibly timely and on-the-ball. [I imagine that they'd > love to know that I saw a funny blip, but I never bothered to tell > them....] Hm. well, the price seems a reasonable thing to pay to avoid maintaining that myself, and I can always go back and write that myself. I'm trying the 'free trial' now. _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
