As promised, the post-mortem. tl,dr: the corruption issue we had in december is still there, and bites us every now and then. We're not entirely sure what is causing the corruption, but we suspect NFS, and are working to move the database to a local filesystem.
Long story: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160112-20160111-toollabs-SGE Again, sorry for the disruptions. Unfortunately we cannot guarantee there will not be more of these outages in the near future. Merlijn On 11 January 2016 at 23:15, Merlijn van Deen <[email protected]> wrote: > Somehow sending an e-mail to labs-l seems to resolve issues magically. The > issue started around 21:00 UTC, and I'll write up a post-mortem tomorrow. > > On 11 January 2016 at 23:10, Merlijn van Deen <[email protected]> > wrote: > >> Jobs are being queued, but are not executing. Every now and then a few >> jobs /are/ executed, but the backlog is ~20 minutes. We're not quite sure >> what's happening, unfortunately, but we're working on it. >> > >
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
