I must also congratulate the ops team on this, tracking down a failed cable is almost worst than a needle in a haystack. Doing so in less than 4 hours is phenomenal. Ive only seen a small handful of cases where a cable is the cause of a failure.
On Fri, Nov 7, 2014 at 7:27 AM, Petr Bena <[email protected]> wrote: > Shit happens. > > Given that this outage was so quickly resolved I nearly didn't even > notice it. Good job devops! > > On Fri, Nov 7, 2014 at 7:45 AM, Pine W <[email protected]> wrote: > > That sounds good. This story provides a good illustration of the > usefulness > > of having spares of critical parts. > > > > I've experienced my share of hardware mysteries. Today's mystery for me > > involved a router > > > > Thank yoy again to those of you who were the hardware medics today. > > > > Regards, > > > > Pine > > > > On Nov 6, 2014 7:15 PM, "Marc A. Pelletier" <[email protected]> wrote: > >> > >> On 11/06/2014 05:13 PM, Pine W wrote: > >> > It will be interesting to see the post-action report and > recommendations > >> > for prevention, if possible. > >> > >> There is, in the end, very little that can be done to prevent freak > >> failures of the sort; they are thankfully rare but basically impossible > >> to predict. > >> > >> The disk shelves have a lot of redundancy, but the two channels can be > >> used either to multipath to a single server, or to wire two distinct > >> servers; we chose the latter because servers - as a whole - have a lot > >> more moving parts and a much shorter MTBF. This makes us more > >> vulnerable to the rarer failure of the communication path, and much less > >> vunlerable to the server /itself/ having a failure of some sort. > >> > >> This time, we were just extremely unlucky. Cabling rarely fails if it > >> worked at all, and the chances that one would suddenly stop working > >> right after a year of use is ridiculously low. This is why it took > >> quite a bit of time to even /locate/ the fault: we tried pretty much > >> everything /else/ first given how improbable a cable fault is. The > >> actual fix took less than 15 minutes all told; the roughly three hours > >> prior were spent trying to find the fault everywhere else first. > >> > >> I'm not sure there's anything we could have done differently, or that we > >> should do differently in the future. We were able to diagnose the > >> problem at all because we had pretty much all the hardware in double at > >> the DC, and had we not isolated the fault we could still have fired up > >> the backup server (once we had eleminated the shelves themselves as > >> being faulty). > >> > >> The only thing we're missing right now is a spare disk enclosure; if we > >> had had a failed shelf we would have been stuck having to wait for a > >> replacement from the vendor rather than being able to simply swap the > >> hardware on the spot. That's an issue that I will raise at the next > >> operations meeting. > >> > >> -- Marc > >> > >> > >> _______________________________________________ > >> Labs-l mailing list > >> [email protected] > >> https://lists.wikimedia.org/mailman/listinfo/labs-l > > > > > > _______________________________________________ > > Labs-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/labs-l > > > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
