A similar failure just happened on a different compute node. We're
researching to see if these two failures were related.
In the meantime -- all hosts are restarting and everything should be up
within a couple of minutes -- total downtime no more than 10 minutes. A
full list of affected instances are at the bottom of this page:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150518-LabsOutage
-A
On 5/16/15 7:30 AM, Andrew Bogott wrote:
This turns out to not have been a heating issue, or at least not
entirely -- it was some kind of kernel lockup. Coren and others
rebooted the system and restarted all instances, and things seem to be
working fine now. We don't have much explanation for what caused the
problem, though, so we'll be on the lookout.
-A
On 5/15/15 11:31 PM, Andrew Bogott wrote:
The hardware curse continues!
One of the labs virt hosts (labvirt1003) is running very hot tonight,
presumably due to a broken fan. It is intermittently scaling the CPU
speed way back to avoid melting; when that happens there are bound to
be lots of side-effects like unresponsive instances, clock drift, and
the like (not least of which is that right now I can't ssh into the
damn thing, or get performance metrics.)
Naturally this started happening late on a Friday, so it may be a
while before I can get someone in the datacenter. I'm leaving the
host up in the meantime, based on the notion that half a server is
better than none, but poor performance is likely to be the norm in
the meantime.
I did shut off one instance: wikidata-wdq-mm. I don't have a
personal grudge, but it was gobbling CPU cycles and the system really
needs a rest. If loss of that instance is a disaster for anyone,
contact me and I'll see if I can revive it and shut off ten or so
other instances to make room.
Updates as events warrant!
-Andrew
_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l