(resending from the correct email account)
tl;dr:
An issue with our storage backend caused many VMs to freeze off and on
over the last 12 hours or so. This affected toolforge as well as many
cloud-vps projects.
Everything should be recovered now but if you find a misbehaving VM, a
hard reboot should resolve the issue.
Longer version:
Yesterday I began the process of upgrading many of our Ceph storage
nodes from debian 11.0 Bullseye to Debian 12.0 Bookworm. An
not-yet-understood interaction between our ceph version (16.2.15) and
Debian Bookworm produced runaway memory usage on the upgraded servers
which meant that after a few hours they began to swap and the ceph
services began to freeze intermittently.
Ceph is resilient to failures like this, but in some cases multiple ceph
services (which would normally have served as backups for each other)
froze at the same time. During partial storage failures Ceph prioritizes
data integrity over availability, and so began to make some storage
blocks unavailable and/or read-only. That erratic storage behavior in
turn caused VMs to crash or (more often) temporarily lock up.
We are now reverting the Bookworm upgrade, and will bypass the broken
combination of Bookworm and 16.x in future upgrades. Not all OSD nodes
have been rebuilt with Bullseye yet, but we have rebuilt enough that
Ceph should be able to cope with service failures on the servers that
are still pending rebuild.
As far as I can tell, most or all VMs have recovered on their own, and
we don't see any evidence of data corruption. If you find unresponsive
VMs, a 'hard reboot' from the horizon UI should resolve any remaining
issues. Please follow up with this email or with me on IRC if you find
data corruption or VMs that cannot be revived.
Full details of this incident can be found in this phabricator ticket:
https://phabricator.wikimedia.org/T399281
Thanks to Francesco, Ben Tullis, and Alexandros Kosiaris for their
speedy assistance in resolving this issue!
- Andrew
_______________________________________________
Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/