Re: [Labs-l] SGE issues again

Merlijn van Deen Tue, 12 Jan 2016 12:40:25 -0800

As promised, the post-mortem.

tl,dr: the corruption issue we had in december is still there, and bites us
every now and then. We're not entirely sure what is causing the corruption,
but we suspect NFS, and are working to move the database to a local
filesystem.


Long story:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20160112-20160111-toollabs-SGE

Again, sorry for the disruptions. Unfortunately we cannot guarantee there
will not be more of these outages in the near future.

Merlijn

On 11 January 2016 at 23:15, Merlijn van Deen <[email protected]> wrote:

> Somehow sending an e-mail to labs-l seems to resolve issues magically. The
> issue started around 21:00 UTC, and I'll write up a post-mortem tomorrow.
>
> On 11 January 2016 at 23:10, Merlijn van Deen <[email protected]>
> wrote:
>
>> Jobs are being queued, but are not executing. Every now and then a few
>> jobs /are/ executed, but the backlog is ~20 minutes. We're not quite sure
>> what's happening, unfortunately, but we're working on it.
>>
>
>

_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Re: [Labs-l] SGE issues again

Reply via email to