As you probably noticed https://wiki.documentfoundation.org had major hiccups on Tuesday between 05:30 UTC and 12:00 UTC, which we were not able to fix live; we thus took it offline for unplanned maintenance, and finally brought it back at 17:00 UTC. We apologize for the inconvenience, and we thank all of you for your patience during this time.
Those lurking on the #tdf-infra IRC channel could see what was going on; if you're interested in the technical details to above's summary, see below. https://wiki.documentfoundation.org (and also https://help.libreoffice.org) where running MediaWiki 1.29.2 in a Debian 8 VM. MediaWiki's 1.29 branch will be End-of-Life by July 1st, so we had to upgrade before that, as mentioned in the last infra call minutes [0]. At the end of last week the wiki started showing strange symptoms, with the PHP FPM workers randomly idling and refusing to take requests, or SIGSEGV'ing. Since we were preparing an upgrade of the whole stack (MW 1.29 → 1.31, PHP 5.6 → 7.0, MySQL 5.5 → MariaDB 10.1, nginx 1.6 → 1.10, Debian 8 → 9) we didn't spend a lot of time trying to investigate the issue with the production stack: restarting php5-fpm daily or so was enough to make it work for about a day. On Mon Jun 11 we changed the wiki's authentication method from username/password to Single Sign-On [0], but we don't think it's related to that issue in any way. (For one, the problem started over a week later.) While we now have an idea what the problem was, we don't understand why it suddenly started showing up last week: the last MW upgrade was performed on Nov 14 last year, and no Debian package had been upgraded since Jun 11 (in particular, the last PHP FPM upgrade was performed on Jan 9). We finished upgrading and testing our test instance this week-end. It runs a database snapshot of https://wiki.documentfoundation.org dating from about 1 year ago, but other than that the OS and MediaWiki configuration are identical to the production instance. We had to do some minor tuning but things mostly looked fine, and I scheduled the upgrade of the production instance on Mon → Tue European night. Initially the plan was to upgrade both the OS and the MediaWiki (first to 1.30 then to 1.31) the same night, but it was past 04:30 AM UTC when https://wiki.documentfoundation.org and https://help.libreoffice.org where upgraded to 1.30, and as the OS upgrade causes high I/O load and a short downtime, I deferred it to the next European night to affect as few users as possible. The MediaWiki instances were still running fine shortly after 05:00 UTC. We get a lot of visits from Europe however, and with the European office hours starting, the load quickly started to raise, as well as CPU and memory usage, and finally brought https://wiki.documentfoundation.org down to its knees. Curiously https://help.libreoffice.org was not affected at all, although it was running on the very same VM, in the same PHP FPM pool, has exact same MediaWiki code (and mostly similar configuration), and gets 10x more visits than TDF wiki… Throwing more CPU and RAM at the VM didn't help solving the issue. After spending a couple of hours(!) diagnosing it, doing of lot of speculation, tests and debugging, we decided to bring the instance down to relieve the VM off some load, and perform the due OS upgrade. The OS upgrade (Debian 8 → 9), as well as MediaWiki upgrade (1.30 → 1.31) were finished at 14:00 UTC, but unfortunately that didn't help. Worst, every single request was now causing the PHP FPM worker to max out CPU and ramp up memory-wise. And meanwhile the help wiki was still running glitchlessly on the brand new PHP 7 / MariaDB stack, which we couldn't explain. Of course, we did try to reduce the delta between the two MediaWiki instances; but removing all extensions and configuration options that were not in common didn't help, either. Studying the trace of a PHP FPM child that had gone wild, we saw — a bit by chance — that it was choking on getting the parent category of links; possibly something related to a reported MediaWiki bug [1]. So we tried to disable all options related to Categories, and… bingo! Of course during that that whole time, it was always an option to restore the database from backup and re-deploy MediaWiki 1.29 (in read-only mode to avoid divergence). Should the problem have persisted until the evening we would have done that before the night, but during day time we were all busy debugging and investigating, and deploying another instance would have meant allocating less resources at trying to fix the problem. As to why this was affecting the production instance but not the test instance, the culprit might be the number of requests, as towards the end of the afternoon the situation looked very bleak (a single request to the PHP FPM pool was maxing out its worker thread) while in the early morning clients that were lucky enough to get an idling FPM child were able to receive a request. On the other hand, it's still mysterious why https://help.libreoffice.org was not affected. And so is the fact that our production suddenly instance started having hiccups last week, a bit out of the blue. All in all, it was a tough day for everyone… We now hope you'll enjoy the new Wiki and its stack below! On the positive side, it's now eating less resources than before, and is more responsive; thus hopefully providing better experience both for front-end users as well as for the infra team :-) -- Guilhem, on behalf on TDF infra team. [0] https://listarchives.libreoffice.org/global/website/msg15105.html [1] https://phabricator.wikimedia.org/T165099 -- To unsubscribe e-mail to: [email protected] Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette List archive: https://listarchives.libreoffice.org/global/projects/ Privacy Policy: https://www.documentfoundation.org/privacy
