Re: [Cloud] Getting a lot of 502, 503 server errors on toolforge ???

Arthur Smith Mon, 04 Jan 2021 15:59:40 -0800

I've restarted it 3 times already!

On Mon, Jan 4, 2021 at 5:41 PM Brooke Storm <[email protected]> wrote:


> Hello Arthur,
> I suspect this could be related to a serious problem with LDAP TLS that
> happened yesterday around the time I’m seeing in the graph. Some
> information is on this ticket (https://phabricator.wikimedia.org/T271063).
> That broke Gerrit authentication and lots of other things that are Cloud
> Services and Toolforge related until it was resolved. That said, it sounds
> like there is also something else going on perhaps that we can take a look
> into. If you haven’t already, restarting the web service might not be a bad
> idea.
>
> If it doesn’t clear up with a restart, please make a Phabricator task to
> help coordinate.
>
> Brooke Storm
> Staff SRE
> Wikimedia Cloud Services
> [email protected]
>
>
>
> On Jan 4, 2021, at 3:27 PM, Arthur Smith <[email protected]> wrote:
>
> My toolforge service (https://author-disambiguator.toolforge.org/) keeps
> becoming unavailable with hangs/502 Bad Gateway or other server errors a
> few minutes after I restart it, and I can't see what could be causing this.
> These errors don't show up in the error log, and the 502 responses don't
> show up in the access log (which has had very little  traffic anyway - one
> request per minute at most usually?) I can connect to the kubernetes pod
> with kubectl and everything looks normal,there's only a few processes
> listed in /proc, etc. (though it would be nice to have some other
> monitoring tools like ps and netstat installed by default?) But I can't get
> a response via the web after the first few minutes.
>
> The problem seems to have started mid-day yesterday - see the monitor data
> here:
>
>
> https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&refresh=5m&var-namespace=tool-author-disambiguator
>
> with the surge in 4xx and 5xx status  codes on 1/3 (by the way, I don't
> see the surge in 4xx status codes in access.log recently either - there are
> 2 from this morning and none yesterday, nothing like the multiple per
> second indicated in that grafana chart!)
>
> Any ideas what's going on? This looks like some sort of upstream issue
> with nginx maybe?
>
> I am seeing a "You have run out of local ports" error in the error logs
> from earlier today (but it hasn't repeated recently) which is maybe a clue?
> I don't think that could possibly be from anything my service is doing
> though!
>
> Help would be greatly appreciated, thanks!
>
>    Arthur Smith
> _______________________________________________
> Wikimedia Cloud Services mailing list
> [email protected] (formerly [email protected])
> https://lists.wikimedia.org/mailman/listinfo/cloud
>
>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> [email protected] (formerly [email protected])
> https://lists.wikimedia.org/mailman/listinfo/cloud
>

_______________________________________________
Wikimedia Cloud Services mailing list
[email protected] (formerly [email protected])
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] Getting a lot of 502, 503 server errors on toolforge ???

Reply via email to