[ceph-users] Re: OSDs won't start after upgrading reef (18.2.7) to squid (19.2.3) with orchestrator

Reto Gysi via ceph-users Thu, 29 Jan 2026 15:57:26 -0800

Hi Boris,

I had a similiar problem in the past (I don't remember which ceph release
it was) that the block.db links in /var/lib/ceph/<UUID>/osd.N weren't set
correctly by the ceph-volume activate script call in the containers
unit.run script.
The problem was that I had the block.db on a lvm raid1 mirror on SSDs (so a
failure of a single SSD doesn't take down multiple OSDs), and the
"ceph-volume activate" script call couldn't properly handle this case.  It
took one of the entries that had the lv_role private instead of public


For example:
  LV                                             Devices
                                            Role

 ceph-db-osd1

ceph-db-osd1_rimage_0(0),ceph-db-osd1_rimage_1(0)
  public
 [ceph-db-osd1_rimage_0]
                       ceph-db-osd1_rimage_0_iorig(0)
                     private,raid,image
 [ceph-db-osd1_rimage_0_imeta]                  /dev/sdg(8314)
                                     private,integrity,metadata

 [ceph-db-osd1_rimage_0_iorig]                  /dev/sdg(9216)
                                     private,integrity,origin,integrityorigin
 [ceph-db-osd1_rimage_0_iorig]                  /dev/sdg(82518)
                                    private,integrity,origin,integrityorigin
 [ceph-db-osd1_rimage_0_iorig]                  /dev/sdg(55297)
                                    private,integrity,origin,integrityorigin
 [ceph-db-osd1_rimage_0_iorig]                  /dev/sdg(59888)
                                    private,integrity,origin,integrityorigin
 [ceph-db-osd1_rimage_0_iorig]                  /dev/sdg(62448)
                                    private,integrity,origin,integrityorigin
 [ceph-db-osd1_rimage_1]
                       ceph-db-osd1_rimage_1_iorig(0)
                     private,raid,image
 [ceph-db-osd1_rimage_1_imeta]                  /dev/sdh(91281)
                                    private,integrity,metadata

 [ceph-db-osd1_rimage_1_iorig]                  /dev/sdh(1)

private,integrity,origin,integrityorigin
 [ceph-db-osd1_rimage_1_iorig]                  /dev/sdh(84486)
                                    private,integrity,origin,integrityorigin
 [ceph-db-osd1_rimage_1_iorig]                  /dev/sdh(89094)
                                    private,integrity,origin,integrityorigin
 [ceph-db-osd1_rmeta_0]                         /dev/sdg(46080)
                                    private,raid,metadata

 [ceph-db-osd1_rmeta_1]                         /dev/sdh(0)
                                        private,raid,metadata



I think I fixed it by manually changing the ceph-volume activate call in
the unit.run to ceph-volume lvm activate  ceph-volume lvm activate <ID>
<FSID>.

Luckily my cluster contains not that many OSDs.

With ceph tentacle release I no longer had to manually change the unit.run
script call.

Cheers,

Reto

Am Di., 27. Jan. 2026 um 11:55 Uhr schrieb Boris via ceph-users <
[email protected]>:

> Hi,
> we are currently facing an issue, that suddenly none of the OSDs will start
> after the container started with the new versions.
>
> This seems to be an issue with some hosts/OSDs. The first 30 OSDs worked,
> but took really long (like 5 hours) and then every single OSD after that
> needed a host reboot to bring the disk back up and continue the update.
>
> We've stopped after 6 tries.
>
> And one disk never came back up. We removed and zapped the OSD. The
> orchestrator picked the available disk and recreated it. It came up within
> seconds.
>
> We have around 90 clusters and this happened only on a single one. All
> others updates within two hours without any issues.
>
> The cluster uses HDDs (8TB) with the block.db on SSD (5 block.db per SSD).
> The file /var/log/ceph/UUID/ceph-volume.log get hammered with a lot of
> output from udevadm, lsblk and nsenter
> The activation container (ceph-UUID-osd-N-activate) gets killed after a
> couple of minutes.
> It also looks like the block and block.db links
> in /var/lib/ceph/UUID/osd.N/ are not correctly set.
> When we restart the daemons that needed a host restart, the OSD doesn't
> come up and needs a host restart.
>
> All OSDs are encrypted.
>
> Does anyone got some ideas how to debug further?
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: OSDs won't start after upgrading reef (18.2.7) to squid (19.2.3) with orchestrator

Reply via email to