Feel free to nuke whole icinga project if it free up some resources. On Mon, May 18, 2015 at 7:16 PM, Andrew Bogott <[email protected]> wrote: > A similar failure just happened on a different compute node. We're > researching to see if these two failures were related. > > In the meantime -- all hosts are restarting and everything should be up > within a couple of minutes -- total downtime no more than 10 minutes. A > full list of affected instances are at the bottom of this page: > > https://wikitech.wikimedia.org/wiki/Incident_documentation/20150518-LabsOutage > > -A > > > > On 5/16/15 7:30 AM, Andrew Bogott wrote: >> >> This turns out to not have been a heating issue, or at least not entirely >> -- it was some kind of kernel lockup. Coren and others rebooted the system >> and restarted all instances, and things seem to be working fine now. We >> don't have much explanation for what caused the problem, though, so we'll be >> on the lookout. >> >> -A >> >> >> On 5/15/15 11:31 PM, Andrew Bogott wrote: >>> >>> The hardware curse continues! >>> >>> One of the labs virt hosts (labvirt1003) is running very hot tonight, >>> presumably due to a broken fan. It is intermittently scaling the CPU speed >>> way back to avoid melting; when that happens there are bound to be lots of >>> side-effects like unresponsive instances, clock drift, and the like (not >>> least of which is that right now I can't ssh into the damn thing, or get >>> performance metrics.) >>> >>> Naturally this started happening late on a Friday, so it may be a while >>> before I can get someone in the datacenter. I'm leaving the host up in the >>> meantime, based on the notion that half a server is better than none, but >>> poor performance is likely to be the norm in the meantime. >>> >>> I did shut off one instance: wikidata-wdq-mm. I don't have a personal >>> grudge, but it was gobbling CPU cycles and the system really needs a rest. >>> If loss of that instance is a disaster for anyone, contact me and I'll see >>> if I can revive it and shut off ten or so other instances to make room. >>> >>> Updates as events warrant! >>> >>> -Andrew >> >> > > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
