Re: [Labs-l] Yet another partial labs outage (Resolved)

Andrew Bogott Sat, 16 May 2015 05:31:13 -0700

This turns out to not have been a heating issue, or at least notentirely -- it was some kind of kernel lockup. Coren and othersrebooted the system and restarted all instances, and things seem to beworking fine now. We don't have much explanation for what caused theproblem, though, so we'll be on the lookout.

-A



On 5/15/15 11:31 PM, Andrew Bogott wrote:

The hardware curse continues!
One of the labs virt hosts (labvirt1003) is running very hot tonight,presumably due to a broken fan. It is intermittently scaling the CPUspeed way back to avoid melting; when that happens there are bound tobe lots of side-effects like unresponsive instances, clock drift, andthe like (not least of which is that right now I can't ssh into thedamn thing, or get performance metrics.)
Naturally this started happening late on a Friday, so it may be awhile before I can get someone in the datacenter. I'm leaving thehost up in the meantime, based on the notion that half a server isbetter than none, but poor performance is likely to be the norm in themeantime.
I did shut off one instance: wikidata-wdq-mm. I don't have apersonal grudge, but it was gobbling CPU cycles and the system reallyneeds a rest. If loss of that instance is a disaster for anyone,contact me and I'll see if I can revive it and shut off ten or soother instances to make room.
Updates as events warrant!

-Andrew



_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Re: [Labs-l] Yet another partial labs outage (Resolved)

Reply via email to