Shit happens. Given that this outage was so quickly resolved I nearly didn't even notice it. Good job devops!
On Fri, Nov 7, 2014 at 7:45 AM, Pine W <[email protected]> wrote: > That sounds good. This story provides a good illustration of the usefulness > of having spares of critical parts. > > I've experienced my share of hardware mysteries. Today's mystery for me > involved a router > > Thank yoy again to those of you who were the hardware medics today. > > Regards, > > Pine > > On Nov 6, 2014 7:15 PM, "Marc A. Pelletier" <[email protected]> wrote: >> >> On 11/06/2014 05:13 PM, Pine W wrote: >> > It will be interesting to see the post-action report and recommendations >> > for prevention, if possible. >> >> There is, in the end, very little that can be done to prevent freak >> failures of the sort; they are thankfully rare but basically impossible >> to predict. >> >> The disk shelves have a lot of redundancy, but the two channels can be >> used either to multipath to a single server, or to wire two distinct >> servers; we chose the latter because servers - as a whole - have a lot >> more moving parts and a much shorter MTBF. This makes us more >> vulnerable to the rarer failure of the communication path, and much less >> vunlerable to the server /itself/ having a failure of some sort. >> >> This time, we were just extremely unlucky. Cabling rarely fails if it >> worked at all, and the chances that one would suddenly stop working >> right after a year of use is ridiculously low. This is why it took >> quite a bit of time to even /locate/ the fault: we tried pretty much >> everything /else/ first given how improbable a cable fault is. The >> actual fix took less than 15 minutes all told; the roughly three hours >> prior were spent trying to find the fault everywhere else first. >> >> I'm not sure there's anything we could have done differently, or that we >> should do differently in the future. We were able to diagnose the >> problem at all because we had pretty much all the hardware in double at >> the DC, and had we not isolated the fault we could still have fired up >> the backup server (once we had eleminated the shelves themselves as >> being faulty). >> >> The only thing we're missing right now is a spare disk enclosure; if we >> had had a failed shelf we would have been stuck having to wait for a >> replacement from the vendor rather than being able to simply swap the >> hardware on the spot. That's an issue that I will raise at the next >> operations meeting. >> >> -- Marc >> >> >> _______________________________________________ >> Labs-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/labs-l > > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l > _______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
