(resending from the correct email account)

tl;dr:


An issue with our storage backend caused many VMs to freeze off and on over the last 12 hours or so. This affected toolforge as well as many cloud-vps projects.

Everything should be recovered now but if you find a misbehaving VM, a hard reboot should resolve the issue.


Longer version:

Yesterday I began the process of upgrading many of our Ceph storage nodes from debian 11.0 Bullseye to Debian 12.0 Bookworm. An not-yet-understood interaction between our ceph version (16.2.15) and Debian Bookworm produced runaway memory usage on the upgraded servers which meant that after a few hours they began to swap and the ceph services began to freeze intermittently.

Ceph is resilient to failures like this, but in some cases multiple ceph services (which would normally have served as backups for each other) froze at the same time. During partial storage failures Ceph prioritizes data integrity over availability, and so began to make some storage blocks unavailable and/or read-only. That erratic storage behavior in turn caused VMs to crash or (more often) temporarily lock up.

We are now reverting the Bookworm upgrade, and will bypass the broken combination of Bookworm and 16.x in future upgrades. Not all OSD nodes have been rebuilt with Bullseye yet, but we have rebuilt enough that Ceph should be able to cope with service failures on the servers that are still pending rebuild.

As far as I can tell, most or all VMs have recovered on their own, and we don't see any evidence of data corruption. If you find unresponsive VMs, a 'hard reboot' from the horizon UI should resolve any remaining issues. Please follow up with this email or with me on IRC if you find data corruption or VMs that cannot be revived.

Full details of this incident can be found in this phabricator ticket: https://phabricator.wikimedia.org/T399281

Thanks to Francesco, Ben Tullis, and Alexandros Kosiaris for their speedy assistance in resolving this issue!

- Andrew


_______________________________________________
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/

Reply via email to