I  mean in Phabricator - https://phabricator.wikimedia.org/T271181

  Arthur

On Mon, Jan 4, 2021 at 8:14 PM Arthur Smith <[email protected]> wrote:

> Ok, see T271181  in  Toolforge.
>
>   Arthur
>
> On Mon, Jan 4, 2021 at 6:59 PM Arthur Smith <[email protected]>
> wrote:
>
>> I've restarted it 3 times already!
>>
>> On Mon, Jan 4, 2021 at 5:41 PM Brooke Storm <[email protected]> wrote:
>>
>>> Hello Arthur,
>>> I suspect this could be related to a serious problem with LDAP TLS that
>>> happened yesterday around the time I’m seeing in the graph. Some
>>> information is on this ticket (https://phabricator.wikimedia.org/T271063).
>>> That broke Gerrit authentication and lots of other things that are Cloud
>>> Services and Toolforge related until it was resolved. That said, it sounds
>>> like there is also something else going on perhaps that we can take a look
>>> into. If you haven’t already, restarting the web service might not be a bad
>>> idea.
>>>
>>> If it doesn’t clear up with a restart, please make a Phabricator task to
>>> help coordinate.
>>>
>>> Brooke Storm
>>> Staff SRE
>>> Wikimedia Cloud Services
>>> [email protected]
>>>
>>>
>>>
>>> On Jan 4, 2021, at 3:27 PM, Arthur Smith <[email protected]> wrote:
>>>
>>> My toolforge service (https://author-disambiguator.toolforge.org/)
>>> keeps becoming unavailable with hangs/502 Bad Gateway or other server
>>> errors a few minutes after I restart it, and I can't see what could be
>>> causing this. These errors don't show up in the error log, and the 502
>>> responses don't show up in the access log (which has had very little
>>> traffic anyway - one request per minute at most usually?) I can connect to
>>> the kubernetes pod with kubectl and everything looks normal,there's only a
>>> few processes listed in /proc, etc. (though it would be nice to have some
>>> other monitoring tools like ps and netstat installed by default?) But I
>>> can't get a response via the web after the first few minutes.
>>>
>>> The problem seems to have started mid-day yesterday - see the monitor
>>> data here:
>>>
>>>
>>> https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&refresh=5m&var-namespace=tool-author-disambiguator
>>>
>>> with the surge in 4xx and 5xx status  codes on 1/3 (by the way, I don't
>>> see the surge in 4xx status codes in access.log recently either - there are
>>> 2 from this morning and none yesterday, nothing like the multiple per
>>> second indicated in that grafana chart!)
>>>
>>> Any ideas what's going on? This looks like some sort of upstream issue
>>> with nginx maybe?
>>>
>>> I am seeing a "You have run out of local ports" error in the error logs
>>> from earlier today (but it hasn't repeated recently) which is maybe a clue?
>>> I don't think that could possibly be from anything my service is doing
>>> though!
>>>
>>> Help would be greatly appreciated, thanks!
>>>
>>>    Arthur Smith
>>> _______________________________________________
>>> Wikimedia Cloud Services mailing list
>>> [email protected] (formerly [email protected])
>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>
>>>
>>> _______________________________________________
>>> Wikimedia Cloud Services mailing list
>>> [email protected] (formerly [email protected])
>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>
>>
_______________________________________________
Wikimedia Cloud Services mailing list
[email protected] (formerly [email protected])
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to