[Wikidata-bugs] [Maniphest] T274270: WDQS servers taking up to 30 minutes to reboot

2022-03-31 Thread bking
bking added a comment. Correction: both MDRAID and LVM servers have this problem. Both services' systemd unit files have the same "Conflicts=shutdown.target" directive. Still haven't tried the systemd workaround though, will test that today. TASK DETAIL https://phabricator.wik

[Wikidata-bugs] [Maniphest] T274270: WDQS servers taking up to 30 minutes to reboot

2022-03-30 Thread bking
bking added a comment. Another piece of the puzzle, some wdqs hosts use MDRAID for their /srv partition, some use LVM <https://phabricator.wikimedia.org/P23901> . Working assumption is that only the LVM hosts will take forever to reboot. TASK DETAIL https://phabricator.wikimed

[Wikidata-bugs] [Maniphest] T274270: WDQS servers taking up to 30 minutes to reboot

2022-03-29 Thread bking
bking added a comment. Actions tried so far: disabling swap via systemd before rebooting. Worked on `wdqs2007`, did not work on `wdqs2002`. Also worth noting is that we had previously rebooted `wdqs2007` within the last 30 minutes, so a minor kernel update (from 4.19.0-16-amd64 to 4.19.0-20

[Wikidata-bugs] [Maniphest] T274270: WDQS servers taking up to 30 minutes to reboot

2022-03-29 Thread bking
bking added a comment. This is still happening, @RKemper found some interesting links that could explain this behavior: https://wiki.freedesktop.org/www/Software/systemd/Debugging/#diagnosingshutdownproblems https://old.reddit.com/r/archlinux/comments/ba3zec

[Wikidata-bugs] [Maniphest] T242453: Detect and alert and/or remediate Blazegraph deadlocks

2022-03-29 Thread bking
bking added a comment. Per conversation with dcausse, we could potentially run jstack on a timer and grep the output for errors as shown above, then alert and/or remediate. TASK DETAIL https://phabricator.wikimedia.org/T242453 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings

[Wikidata-bugs] [Maniphest] T242453: Detect and alert and/or remediate Blazegraph deadlocks

2022-03-29 Thread bking
bking renamed this task from "Deadlock in blazegraph blocking all queries and updates" to "Detect and alert and/or remediate Blazegraph deadlocks". TASK DETAIL https://phabricator.wikimedia.org/T242453 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/pa

[Wikidata-bugs] [Maniphest] T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol

2022-03-14 Thread bking
bking added a comment. Per messages above, we have completely failed over the wdqs and wdqs-internal services from eqiad to codfw. TASK DETAIL https://phabricator.wikimedia.org/T302494 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: RKemper

[Wikidata-bugs] [Maniphest] T303134: Should wdqs LVS checks page

2022-03-14 Thread bking
bking claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T303134 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bking Cc: jbond, Aklapper, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana

[Wikidata-bugs] [Maniphest] T301953: Investigate wdqs1013 stability issues

2022-03-14 Thread bking
bking added a comment. Suggestions: - Data reload - Server reimage - Hardware tests - Close observation over a limited time TASK DETAIL https://phabricator.wikimedia.org/T301953 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bking Cc

[Wikidata-bugs] [Maniphest] T301953: Investigate wdqs1013 stability issues

2022-03-14 Thread bking
bking claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T301953 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bking Cc: bking, Aklapper, Zbyszko, Astuthiodit_1, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz

[Wikidata-bugs] [Maniphest] T293862: Investigate using jvmquake to limit the time a JVM is unusable due to GC overhead

2022-03-14 Thread bking
bking added a comment. Manually installed on wdqs1010 TASK DETAIL https://phabricator.wikimedia.org/T293862 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse, bking Cc: bking, Aklapper, dcausse, Astuthiodit_1, karapayneWMDE, Invadibot

[Wikidata-bugs] [Maniphest] T296470: Initialize WCQS production servers

2022-01-11 Thread bking
bking added a comment. Started data load via tmux session on cumin1001 at ~ `Tue Jan 11 16:53:46 2022` . Expected to take at least 24 hours. TASK DETAIL https://phabricator.wikimedia.org/T296470 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences

[Wikidata-bugs] [Maniphest] T298525: Tune "BlazegraphFreeAllocatorsDecreasingRapidly" alerts

2022-01-04 Thread bking
bking added a comment. Related commits here <https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+log/refs/heads/master/team-search-platform/blazegraph.yaml> TASK DETAIL https://phabricator.wikimedia.org/T298525 EMAIL PREFERENCES https://phabricator.wikimedia.org/se

[Wikidata-bugs] [Maniphest] T298525: Tune "BlazegraphFreeAllocatorsDecreasingRapidly" alerts

2022-01-04 Thread bking
bking renamed this task from "Tune "BlazegraphFreeAllocatorsDecreasingRapidly"" to "Tune "BlazegraphFreeAllocatorsDecreasingRapidly" alerts". TASK DETAIL https://phabricator.wikimedia.org/T298525 EMAIL PREFERENCES https://phabricator.wikimedia.org/set

[Wikidata-bugs] [Maniphest] T298525: Tune "BlazegraphFreeAllocatorsDecreasingRapidly"

2022-01-04 Thread bking
bking added a subscriber: dcausse. bking added a comment. More context from @dcausse : The alert is managed by Alertmanager, code stored in Gerrit <https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-search-platform/blazegraph.yaml>

<    1   2   3   4