Re: [Labs-l] Yet another partial labs outage (Resolved)

Petr Bena Wed, 20 May 2015 00:23:03 -0700

This one has no instance list!!

On Wed, May 20, 2015 at 12:29 AM, Yuvi Panda <[email protected]> wrote:
> And again: 
> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150519-LabsOutage
>
>
>
> On Mon, May 18, 2015 at 1:16 PM, Andrew Bogott <[email protected]> wrote:
>> A similar failure just happened on a different compute node.  We're
>> researching to see if these two failures were related.
>>
>> In the meantime -- all hosts are restarting and everything should be up
>> within a couple of minutes -- total downtime no more than 10 minutes.  A
>> full list of affected instances are at the bottom of this page:
>>
>> https://wikitech.wikimedia.org/wiki/Incident_documentation/20150518-LabsOutage
>>
>> -A
>>
>>
>>
>> On 5/16/15 7:30 AM, Andrew Bogott wrote:
>>>
>>> This turns out to not have been a heating issue, or at least not entirely
>>> -- it was some kind of kernel lockup.  Coren and others rebooted the system
>>> and restarted all instances, and things seem to be working fine now.  We
>>> don't have much explanation for what caused the problem, though, so we'll be
>>> on the lookout.
>>>
>>> -A
>>>
>>>
>>> On 5/15/15 11:31 PM, Andrew Bogott wrote:
>>>>
>>>> The hardware curse continues!
>>>>
>>>> One of the labs virt hosts (labvirt1003) is running very hot tonight,
>>>> presumably due to a broken fan.  It is intermittently scaling the CPU speed
>>>> way back to avoid melting; when that happens there are bound to be lots of
>>>> side-effects like unresponsive instances, clock drift, and the like (not
>>>> least of which is that right now I can't ssh into the damn thing, or get
>>>> performance metrics.)
>>>>
>>>> Naturally this started happening late on a Friday, so it may be a while
>>>> before I can get someone in the datacenter.  I'm leaving the host up in the
>>>> meantime, based on the notion that half a server is better than none, but
>>>> poor performance is likely to be the norm in the meantime.
>>>>
>>>> I did shut off one instance:  wikidata-wdq-mm.  I don't have a personal
>>>> grudge, but it was gobbling CPU cycles and the system really needs a rest.
>>>> If loss of that instance is a disaster for anyone, contact me and I'll see
>>>> if I can revive it and shut off ten or so other instances to make room.
>>>>
>>>> Updates as events warrant!
>>>>
>>>> -Andrew
>>>
>>>
>>
>>
>> _______________________________________________
>> Labs-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
>
> --
> Yuvi Panda T
> http://yuvi.in/blog
>
> _______________________________________________
> Labs-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/labs-l


_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Re: [Labs-l] Yet another partial labs outage (Resolved)

Reply via email to