Could be the kind of hardware you are using. Is it different from the other clusters' hardware?

Send us logs, so we can help you out.

Example:

journalctl -eu [email protected]

Best,
Malte

On 1/27/26 11:55, Boris via ceph-users wrote:
Hi,
we are currently facing an issue, that suddenly none of the OSDs will start
after the container started with the new versions.

This seems to be an issue with some hosts/OSDs. The first 30 OSDs worked,
but took really long (like 5 hours) and then every single OSD after that
needed a host reboot to bring the disk back up and continue the update.

We've stopped after 6 tries.

And one disk never came back up. We removed and zapped the OSD. The
orchestrator picked the available disk and recreated it. It came up within
seconds.

We have around 90 clusters and this happened only on a single one. All
others updates within two hours without any issues.

The cluster uses HDDs (8TB) with the block.db on SSD (5 block.db per SSD).
The file /var/log/ceph/UUID/ceph-volume.log get hammered with a lot of
output from udevadm, lsblk and nsenter
The activation container (ceph-UUID-osd-N-activate) gets killed after a
couple of minutes.
It also looks like the block and block.db links
in /var/lib/ceph/UUID/osd.N/ are not correctly set.
When we restart the daemons that needed a host restart, the OSD doesn't
come up and needs a host restart.

All OSDs are encrypted.

Does anyone got some ideas how to debug further?
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to