On Sun, Mar 04, 2012 at 08:29:49PM -0600, Elijah Wright wrote:
> Some folks were discussing MTBF on chassis Friday - does anybody know
> if there are actual MTBF numbers for chassis failures on
> Dell gear (that are not simply made up...)?  We mostly see "lots and
> lots of drive failures" on RAID arrays, and those are pretty easily
> actionable - the number of total chassis failures is so low that I
> have trouble tracking it.  ;)

Chassis/backplane failures are rare.   I know not too long ago I got
a double raid failure that seemed to go away after a reboot (one of the
disks was really bad, one looked just fine after the reboot)  but
we were getting I/O errors on both, and I'm not going to just leave 
a drive in production that was giving me I/O errors, so I called 
it a backplane error and replaced the backplane.  (actually, I swapped
out the whole goddamn chassis, put the good drive and a fresh drive in 
the new chassis and rebuilt the raid, as I generally try to minimize
the amount of time spent fucking around when production is down)

I haven't seen an error on the first drive since.

I ended up replacing the backplane and all the sata cables on the 
chassis that had the issue, and it passed a long and hard burn-in, and
has been production on something else for around a year now, without
problems.   Double disk failures... frighten me.   

But overall?  backplanes seem way more reliable than they have any right
to be.  It's very rare that I see anything that even smells like a 
backplane failure, and then, it's usually something like this
where I can't prove anything.  

> One place that I worked a few years ago, we got within striking
> distance of having our Nagios install able to
> open/close/escalate/self-heal a variety of very common issues.  It's
> *great* when you get to that point - it makes your life not suck.


> 
> > One thing I want is a phone call that won't stop calling until I
> > press a number to ack it, or something.
> 
> That's PagerDuty.  I think you'd like it.  [When it devolves to me
> doing free PR for pagerduty - color me impressed with their service.]
> 
> I think I've seen it 'lag' only one time - the rest of the alerting
> has been incredibly timely and on-the-ball.  [I imagine that they'd
> love to know that I saw a funny blip, but I never bothered to tell
> them....]

Hm.  well, the price seems a reasonable thing to pay to avoid maintaining
that myself, and I can always go back and write that myself.  I'm trying
the 'free trial' now. 
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to