Hi, we are currently facing an issue, that suddenly none of the OSDs will start after the container started with the new versions.
This seems to be an issue with some hosts/OSDs. The first 30 OSDs worked, but took really long (like 5 hours) and then every single OSD after that needed a host reboot to bring the disk back up and continue the update. We've stopped after 6 tries. And one disk never came back up. We removed and zapped the OSD. The orchestrator picked the available disk and recreated it. It came up within seconds. We have around 90 clusters and this happened only on a single one. All others updates within two hours without any issues. The cluster uses HDDs (8TB) with the block.db on SSD (5 block.db per SSD). The file /var/log/ceph/UUID/ceph-volume.log get hammered with a lot of output from udevadm, lsblk and nsenter The activation container (ceph-UUID-osd-N-activate) gets killed after a couple of minutes. It also looks like the block and block.db links in /var/lib/ceph/UUID/osd.N/ are not correctly set. When we restart the daemons that needed a host restart, the OSD doesn't come up and needs a host restart. All OSDs are encrypted. Does anyone got some ideas how to debug further? _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
