Root cause has been found and everything's back to working for the last few hours. Outage report at https://wikitech.wikimedia.org/wiki/Incident_documentation/20150527-GridEngine
Thanks. On Thu, May 28, 2015 at 1:59 PM, Russell Blau <[email protected]> wrote: > Yuvi Panda <yuvipanda <at> gmail.com> writes: > >> >> It's been back and working mostly well for a while now. According to >> alerts the partial outage was from 18:33 UTC to 20:17 UTC. More >> details to follow later, here and at >> https://phabricator.wikimedia.org/T100554 > > This seems not to be entirely fixed. All night, I have been getting > intermittent errors on cron jobs with the following message: > > error: commlib error: access denied (server host resolves rdata host > "tools-submit.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)") > > Curiously, not all grid jobs fail in this way; some of them have been > running successfully, but without any apparent pattern. > > > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l -- Yuvi Panda T http://yuvi.in/blog _______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
