[ceph-users] Re: OSDs won't start after upgrading reef (18.2.7) to squid (19.2.3) with orchestrator

Boris via ceph-users Tue, 27 Jan 2026 03:52:03 -0800

All disks are 8TB HDD. We have some 16TB HDDs but those are all in the
newest host, that updated just fine.


Am Di., 27. Jan. 2026 um 12:39 Uhr schrieb Malte Stroem <
[email protected]>:

>  From what I can see, the OSD is running. It's starting up and still
> needs time.
>
> How big is the disk?
>
> On 1/27/26 12:31, Boris via ceph-users wrote:
> > Sure: https://pastebin.com/9RLzyUQs
> >
> > I've trimmed the log a little bit (removed peering, epoch, trim and so
> on).
> > This is the last OSD that we tried that did not work.
> >
> > We tried another host, where the upgrade just went through. But this Host
> > also got the newest hardware.
> > But we don't think it is a hardware issue, because the first 30 OSDs were
> > on the two oldest hosts and the first one that failed was on the same
> host
> > as the last OSD that did not fail.
> >
> >
> >
> > Am Di., 27. Jan. 2026 um 12:05 Uhr schrieb Malte Stroem <
> > [email protected]>:
> >
> >> Could be the kind of hardware you are using. Is it different from the
> >> other clusters' hardware?
> >>
> >> Send us logs, so we can help you out.
> >>
> >> Example:
> >>
> >> journalctl -eu [email protected]
> >>
> >> Best,
> >> Malte
> >>
> >> On 1/27/26 11:55, Boris via ceph-users wrote:
> >>> Hi,
> >>> we are currently facing an issue, that suddenly none of the OSDs will
> >> start
> >>> after the container started with the new versions.
> >>>
> >>> This seems to be an issue with some hosts/OSDs. The first 30 OSDs
> worked,
> >>> but took really long (like 5 hours) and then every single OSD after
> that
> >>> needed a host reboot to bring the disk back up and continue the update.
> >>>
> >>> We've stopped after 6 tries.
> >>>
> >>> And one disk never came back up. We removed and zapped the OSD. The
> >>> orchestrator picked the available disk and recreated it. It came up
> >> within
> >>> seconds.
> >>>
> >>> We have around 90 clusters and this happened only on a single one. All
> >>> others updates within two hours without any issues.
> >>>
> >>> The cluster uses HDDs (8TB) with the block.db on SSD (5 block.db per
> >> SSD).
> >>> The file /var/log/ceph/UUID/ceph-volume.log get hammered with a lot of
> >>> output from udevadm, lsblk and nsenter
> >>> The activation container (ceph-UUID-osd-N-activate) gets killed after a
> >>> couple of minutes.
> >>> It also looks like the block and block.db links
> >>> in /var/lib/ceph/UUID/osd.N/ are not correctly set.
> >>> When we restart the daemons that needed a host restart, the OSD doesn't
> >>> come up and needs a host restart.
> >>>
> >>> All OSDs are encrypted.
> >>>
> >>> Does anyone got some ideas how to debug further?
> >>> _______________________________________________
> >>> ceph-users mailing list -- [email protected]
> >>> To unsubscribe send an email to [email protected]
> >>
> >>
> >
>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: OSDs won't start after upgrading reef (18.2.7) to squid (19.2.3) with orchestrator

Reply via email to