On 11/06/2014 05:13 PM, Pine W wrote: > It will be interesting to see the post-action report and recommendations > for prevention, if possible.
There is, in the end, very little that can be done to prevent freak failures of the sort; they are thankfully rare but basically impossible to predict. The disk shelves have a lot of redundancy, but the two channels can be used either to multipath to a single server, or to wire two distinct servers; we chose the latter because servers - as a whole - have a lot more moving parts and a much shorter MTBF. This makes us more vulnerable to the rarer failure of the communication path, and much less vunlerable to the server /itself/ having a failure of some sort. This time, we were just extremely unlucky. Cabling rarely fails if it worked at all, and the chances that one would suddenly stop working right after a year of use is ridiculously low. This is why it took quite a bit of time to even /locate/ the fault: we tried pretty much everything /else/ first given how improbable a cable fault is. The actual fix took less than 15 minutes all told; the roughly three hours prior were spent trying to find the fault everywhere else first. I'm not sure there's anything we could have done differently, or that we should do differently in the future. We were able to diagnose the problem at all because we had pretty much all the hardware in double at the DC, and had we not isolated the fault we could still have fired up the backup server (once we had eleminated the shelves themselves as being faulty). The only thing we're missing right now is a spare disk enclosure; if we had had a failed shelf we would have been stuck having to wait for a replacement from the vendor rather than being able to simply swap the hardware on the spot. That's an issue that I will raise at the next operations meeting. -- Marc _______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
