Re: [Labs-l] [Labs-announce] Mild but long-running Tools outage in process, resolved

Andrew Bogott Thu, 29 Jun 2017 20:31:32 -0700

The kernel roll-back was a success, and things are now behaving reasonably.

At some point we'll get a proper incident report together. The shortversion of the story is: NFS performance was shockingly bad on the newkernel, as illustrated by the attached ridiculous graph.

Once we have a modern, less-broken kernel we'll need to try this allover again, but that won't happen right away and the update window willbe pre-announced.

Thanks for bearing with us through all this! Most services seem to havesurvived this last round of chaos but you might want to check your sitesand restart services as needed.


-Andrew


On 6/29/17 8:25 PM, Andrew Bogott wrote:

After various failed measures, we're now trying to revert back to theolder kernel and switching back between NFS servers yet again. SoTools NFS (and various associated services) will probably break, atleast for a few minutes.
With luck this will get us into a stable place, but I'll update againregardless.
-Andrew


On 6/29/17 3:27 PM, Andrew Bogott wrote:
The tools cluster is suffering from several maladies right now.Existing services seem to be mostly fine, but any kubernetes servicesthat tried to restart in the last few hours probably failed to start,and new things are still failing to start. Similarly, web servicesand other tools are failing to restart in several cases.
There are various theories as to what's going on -- most likelyit's a kernel-version incompatibility with the newly upgraded NFSserver. There was an earlier ldap outage which is better understoodand should be resolved by now.
We apologize for the inconvenience, and are working franticallyto restore stability. There will be a follow-up email when thingsare resolved.
-Andrew

_______________________________________________
Labs-announce mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-announce

_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Re: [Labs-l] [Labs-announce] Mild but long-running Tools outage in process, resolved

Reply via email to