Hi Boris,
I had a similiar problem in the past (I don't remember which ceph release
it was) that the block.db links in /var/lib/ceph/<UUID>/osd.N weren't set
correctly by the ceph-volume activate script call in the containers
unit.run script.
The problem was that I had the block.db on a lvm raid1 mirror on SSDs (so a
failure of a single SSD doesn't take down multiple OSDs), and the
"ceph-volume activate" script call couldn't properly handle this case. It
took one of the entries that had the lv_role private instead of public
For example:
LV Devices
Role
ceph-db-osd1
ceph-db-osd1_rimage_0(0),ceph-db-osd1_rimage_1(0)
public
[ceph-db-osd1_rimage_0]
ceph-db-osd1_rimage_0_iorig(0)
private,raid,image
[ceph-db-osd1_rimage_0_imeta] /dev/sdg(8314)
private,integrity,metadata
[ceph-db-osd1_rimage_0_iorig] /dev/sdg(9216)
private,integrity,origin,integrityorigin
[ceph-db-osd1_rimage_0_iorig] /dev/sdg(82518)
private,integrity,origin,integrityorigin
[ceph-db-osd1_rimage_0_iorig] /dev/sdg(55297)
private,integrity,origin,integrityorigin
[ceph-db-osd1_rimage_0_iorig] /dev/sdg(59888)
private,integrity,origin,integrityorigin
[ceph-db-osd1_rimage_0_iorig] /dev/sdg(62448)
private,integrity,origin,integrityorigin
[ceph-db-osd1_rimage_1]
ceph-db-osd1_rimage_1_iorig(0)
private,raid,image
[ceph-db-osd1_rimage_1_imeta] /dev/sdh(91281)
private,integrity,metadata
[ceph-db-osd1_rimage_1_iorig] /dev/sdh(1)
private,integrity,origin,integrityorigin
[ceph-db-osd1_rimage_1_iorig] /dev/sdh(84486)
private,integrity,origin,integrityorigin
[ceph-db-osd1_rimage_1_iorig] /dev/sdh(89094)
private,integrity,origin,integrityorigin
[ceph-db-osd1_rmeta_0] /dev/sdg(46080)
private,raid,metadata
[ceph-db-osd1_rmeta_1] /dev/sdh(0)
private,raid,metadata
I think I fixed it by manually changing the ceph-volume activate call in
the unit.run to ceph-volume lvm activate ceph-volume lvm activate <ID>
<FSID>.
Luckily my cluster contains not that many OSDs.
With ceph tentacle release I no longer had to manually change the unit.run
script call.
Cheers,
Reto
Am Di., 27. Jan. 2026 um 11:55 Uhr schrieb Boris via ceph-users <
[email protected]>:
> Hi,
> we are currently facing an issue, that suddenly none of the OSDs will start
> after the container started with the new versions.
>
> This seems to be an issue with some hosts/OSDs. The first 30 OSDs worked,
> but took really long (like 5 hours) and then every single OSD after that
> needed a host reboot to bring the disk back up and continue the update.
>
> We've stopped after 6 tries.
>
> And one disk never came back up. We removed and zapped the OSD. The
> orchestrator picked the available disk and recreated it. It came up within
> seconds.
>
> We have around 90 clusters and this happened only on a single one. All
> others updates within two hours without any issues.
>
> The cluster uses HDDs (8TB) with the block.db on SSD (5 block.db per SSD).
> The file /var/log/ceph/UUID/ceph-volume.log get hammered with a lot of
> output from udevadm, lsblk and nsenter
> The activation container (ceph-UUID-osd-N-activate) gets killed after a
> couple of minutes.
> It also looks like the block and block.db links
> in /var/lib/ceph/UUID/osd.N/ are not correctly set.
> When we restart the daemons that needed a host restart, the OSD doesn't
> come up and needs a host restart.
>
> All OSDs are encrypted.
>
> Does anyone got some ideas how to debug further?
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]