I mean in Phabricator - https://phabricator.wikimedia.org/T271181
Arthur On Mon, Jan 4, 2021 at 8:14 PM Arthur Smith <[email protected]> wrote: > Ok, see T271181 in Toolforge. > > Arthur > > On Mon, Jan 4, 2021 at 6:59 PM Arthur Smith <[email protected]> > wrote: > >> I've restarted it 3 times already! >> >> On Mon, Jan 4, 2021 at 5:41 PM Brooke Storm <[email protected]> wrote: >> >>> Hello Arthur, >>> I suspect this could be related to a serious problem with LDAP TLS that >>> happened yesterday around the time I’m seeing in the graph. Some >>> information is on this ticket (https://phabricator.wikimedia.org/T271063). >>> That broke Gerrit authentication and lots of other things that are Cloud >>> Services and Toolforge related until it was resolved. That said, it sounds >>> like there is also something else going on perhaps that we can take a look >>> into. If you haven’t already, restarting the web service might not be a bad >>> idea. >>> >>> If it doesn’t clear up with a restart, please make a Phabricator task to >>> help coordinate. >>> >>> Brooke Storm >>> Staff SRE >>> Wikimedia Cloud Services >>> [email protected] >>> >>> >>> >>> On Jan 4, 2021, at 3:27 PM, Arthur Smith <[email protected]> wrote: >>> >>> My toolforge service (https://author-disambiguator.toolforge.org/) >>> keeps becoming unavailable with hangs/502 Bad Gateway or other server >>> errors a few minutes after I restart it, and I can't see what could be >>> causing this. These errors don't show up in the error log, and the 502 >>> responses don't show up in the access log (which has had very little >>> traffic anyway - one request per minute at most usually?) I can connect to >>> the kubernetes pod with kubectl and everything looks normal,there's only a >>> few processes listed in /proc, etc. (though it would be nice to have some >>> other monitoring tools like ps and netstat installed by default?) But I >>> can't get a response via the web after the first few minutes. >>> >>> The problem seems to have started mid-day yesterday - see the monitor >>> data here: >>> >>> >>> https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&refresh=5m&var-namespace=tool-author-disambiguator >>> >>> with the surge in 4xx and 5xx status codes on 1/3 (by the way, I don't >>> see the surge in 4xx status codes in access.log recently either - there are >>> 2 from this morning and none yesterday, nothing like the multiple per >>> second indicated in that grafana chart!) >>> >>> Any ideas what's going on? This looks like some sort of upstream issue >>> with nginx maybe? >>> >>> I am seeing a "You have run out of local ports" error in the error logs >>> from earlier today (but it hasn't repeated recently) which is maybe a clue? >>> I don't think that could possibly be from anything my service is doing >>> though! >>> >>> Help would be greatly appreciated, thanks! >>> >>> Arthur Smith >>> _______________________________________________ >>> Wikimedia Cloud Services mailing list >>> [email protected] (formerly [email protected]) >>> https://lists.wikimedia.org/mailman/listinfo/cloud >>> >>> >>> _______________________________________________ >>> Wikimedia Cloud Services mailing list >>> [email protected] (formerly [email protected]) >>> https://lists.wikimedia.org/mailman/listinfo/cloud >>> >>
_______________________________________________ Wikimedia Cloud Services mailing list [email protected] (formerly [email protected]) https://lists.wikimedia.org/mailman/listinfo/cloud
