This one has no instance list!! On Wed, May 20, 2015 at 12:29 AM, Yuvi Panda <[email protected]> wrote: > And again: > https://wikitech.wikimedia.org/wiki/Incident_documentation/20150519-LabsOutage > > > > On Mon, May 18, 2015 at 1:16 PM, Andrew Bogott <[email protected]> wrote: >> A similar failure just happened on a different compute node. We're >> researching to see if these two failures were related. >> >> In the meantime -- all hosts are restarting and everything should be up >> within a couple of minutes -- total downtime no more than 10 minutes. A >> full list of affected instances are at the bottom of this page: >> >> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150518-LabsOutage >> >> -A >> >> >> >> On 5/16/15 7:30 AM, Andrew Bogott wrote: >>> >>> This turns out to not have been a heating issue, or at least not entirely >>> -- it was some kind of kernel lockup. Coren and others rebooted the system >>> and restarted all instances, and things seem to be working fine now. We >>> don't have much explanation for what caused the problem, though, so we'll be >>> on the lookout. >>> >>> -A >>> >>> >>> On 5/15/15 11:31 PM, Andrew Bogott wrote: >>>> >>>> The hardware curse continues! >>>> >>>> One of the labs virt hosts (labvirt1003) is running very hot tonight, >>>> presumably due to a broken fan. It is intermittently scaling the CPU speed >>>> way back to avoid melting; when that happens there are bound to be lots of >>>> side-effects like unresponsive instances, clock drift, and the like (not >>>> least of which is that right now I can't ssh into the damn thing, or get >>>> performance metrics.) >>>> >>>> Naturally this started happening late on a Friday, so it may be a while >>>> before I can get someone in the datacenter. I'm leaving the host up in the >>>> meantime, based on the notion that half a server is better than none, but >>>> poor performance is likely to be the norm in the meantime. >>>> >>>> I did shut off one instance: wikidata-wdq-mm. I don't have a personal >>>> grudge, but it was gobbling CPU cycles and the system really needs a rest. >>>> If loss of that instance is a disaster for anyone, contact me and I'll see >>>> if I can revive it and shut off ten or so other instances to make room. >>>> >>>> Updates as events warrant! >>>> >>>> -Andrew >>> >>> >> >> >> _______________________________________________ >> Labs-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/labs-l > > > > -- > Yuvi Panda T > http://yuvi.in/blog > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
