That sounds good. This story provides a good illustration of the usefulness of having spares of critical parts.
I've experienced my share of hardware mysteries. Today's mystery for me involved a router Thank yoy again to those of you who were the hardware medics today. Regards, Pine On Nov 6, 2014 7:15 PM, "Marc A. Pelletier" <[email protected]> wrote: > On 11/06/2014 05:13 PM, Pine W wrote: > > It will be interesting to see the post-action report and recommendations > > for prevention, if possible. > > There is, in the end, very little that can be done to prevent freak > failures of the sort; they are thankfully rare but basically impossible > to predict. > > The disk shelves have a lot of redundancy, but the two channels can be > used either to multipath to a single server, or to wire two distinct > servers; we chose the latter because servers - as a whole - have a lot > more moving parts and a much shorter MTBF. This makes us more > vulnerable to the rarer failure of the communication path, and much less > vunlerable to the server /itself/ having a failure of some sort. > > This time, we were just extremely unlucky. Cabling rarely fails if it > worked at all, and the chances that one would suddenly stop working > right after a year of use is ridiculously low. This is why it took > quite a bit of time to even /locate/ the fault: we tried pretty much > everything /else/ first given how improbable a cable fault is. The > actual fix took less than 15 minutes all told; the roughly three hours > prior were spent trying to find the fault everywhere else first. > > I'm not sure there's anything we could have done differently, or that we > should do differently in the future. We were able to diagnose the > problem at all because we had pretty much all the hardware in double at > the DC, and had we not isolated the fault we could still have fired up > the backup server (once we had eleminated the shelves themselves as > being faulty). > > The only thing we're missing right now is a spare disk enclosure; if we > had had a failed shelf we would have been stuck having to wait for a > replacement from the vendor rather than being able to simply swap the > hardware on the spot. That's an issue that I will raise at the next > operations meeting. > > -- Marc > > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
