[ceph-users] Re: OSDs won't start after upgrading reef (18.2.7) to squid (19.2.3) with orchestrator

Eugen Block via ceph-users Thu, 29 Jan 2026 07:08:57 -0800

I was thinking about the same bug you commented, Boris:

https://tracker.ceph.com/issues/73107
<https://tracker.ceph.com/issues/73107#change-331393>


I am also subscribed to that bug because we upgraded to 19.2.3 a couple of
months ago. But since we don’t have that many OSDs per host, we weren’t
facing any impact as described in the tracker.

Boris via ceph-users <[email protected]> schrieb am Mi. 28. Jan. 2026 um
22:19:

> Hi Malte,
> we just upped the timeout in the service file to 720 (cronjob is going to
> set it every minute).
> Starting the OSDs takes around 4 minutes. We still think this is an issue
> with the ceph-volume activate, because this is taking ages. As soon as the
> cryptsetup open the luks devices the rest is going as normal.
>
> ceph-volume log seems to reflect that:
> root@s3db15:~# ls -alh
> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log*
> -rw-r--r-- 1 root root 2.4G Jan 27 15:59
> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log
> -rw-r--r-- 1 root root 680M Jan 27 00:00
> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.1.gz
> -rw-r--r-- 1 root root 5.0M Jan 25 23:46
> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.2.gz
> -rw-r--r-- 1 root root 4.8M Jan 24 23:35
> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.3.gz
> -rw-r--r-- 1 root root 4.8M Jan 23 23:45
> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.4.gz
> -rw-r--r-- 1 root root 4.8M Jan 22 23:38
> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.5.gz
> -rw-r--r-- 1 root root 4.8M Jan 21 23:39
> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.6.gz
> -rw-r--r-- 1 root root 4.8M Jan 20 23:38
> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.7.gz
> ---
> # time systemctl restart [email protected]
>
> real 3m46.251s
> user 0m0.060s
> sys 0m0.063s
> ---
> I will prepare an output of the ceph-collect from 42on. But I need to
> remove the sensitive stuff like the rgw config and so. But I don't think
> this will give the right clues, as the output won't show what is happening
> during the start.
>
>
> Am Di., 27. Jan. 2026 um 13:51 Uhr schrieb Malte Stroem <
> [email protected]>:
>
> > We do not know a lot about you cluster, so it's hard to help.
> >
> > Give us alle the information one needs.
> >
> > ceph -s, ceph orch host ls, ceph health detail, all the good things to
> > get an overview.
> >
> > On 1/27/26 12:43, Boris via ceph-users wrote:
> > > All disks are 8TB HDD. We have some 16TB HDDs but those are all in the
> > > newest host, that updated just fine.
> > >
> > > Am Di., 27. Jan. 2026 um 12:39 Uhr schrieb Malte Stroem <
> > > [email protected]>:
> > >
> > >>   From what I can see, the OSD is running. It's starting up and still
> > >> needs time.
> > >>
> > >> How big is the disk?
> > >>
> > >> On 1/27/26 12:31, Boris via ceph-users wrote:
> > >>> Sure: https://pastebin.com/9RLzyUQs
> > >>>
> > >>> I've trimmed the log a little bit (removed peering, epoch, trim and
> so
> > >> on).
> > >>> This is the last OSD that we tried that did not work.
> > >>>
> > >>> We tried another host, where the upgrade just went through. But this
> > Host
> > >>> also got the newest hardware.
> > >>> But we don't think it is a hardware issue, because the first 30 OSDs
> > were
> > >>> on the two oldest hosts and the first one that failed was on the same
> > >> host
> > >>> as the last OSD that did not fail.
> > >>>
> > >>>
> > >>>
> > >>> Am Di., 27. Jan. 2026 um 12:05 Uhr schrieb Malte Stroem <
> > >>> [email protected]>:
> > >>>
> > >>>> Could be the kind of hardware you are using. Is it different from
> the
> > >>>> other clusters' hardware?
> > >>>>
> > >>>> Send us logs, so we can help you out.
> > >>>>
> > >>>> Example:
> > >>>>
> > >>>> journalctl -eu [email protected]
> > >>>>
> > >>>> Best,
> > >>>> Malte
> > >>>>
> > >>>> On 1/27/26 11:55, Boris via ceph-users wrote:
> > >>>>> Hi,
> > >>>>> we are currently facing an issue, that suddenly none of the OSDs
> will
> > >>>> start
> > >>>>> after the container started with the new versions.
> > >>>>>
> > >>>>> This seems to be an issue with some hosts/OSDs. The first 30 OSDs
> > >> worked,
> > >>>>> but took really long (like 5 hours) and then every single OSD after
> > >> that
> > >>>>> needed a host reboot to bring the disk back up and continue the
> > update.
> > >>>>>
> > >>>>> We've stopped after 6 tries.
> > >>>>>
> > >>>>> And one disk never came back up. We removed and zapped the OSD. The
> > >>>>> orchestrator picked the available disk and recreated it. It came up
> > >>>> within
> > >>>>> seconds.
> > >>>>>
> > >>>>> We have around 90 clusters and this happened only on a single one.
> > All
> > >>>>> others updates within two hours without any issues.
> > >>>>>
> > >>>>> The cluster uses HDDs (8TB) with the block.db on SSD (5 block.db
> per
> > >>>> SSD).
> > >>>>> The file /var/log/ceph/UUID/ceph-volume.log get hammered with a lot
> > of
> > >>>>> output from udevadm, lsblk and nsenter
> > >>>>> The activation container (ceph-UUID-osd-N-activate) gets killed
> > after a
> > >>>>> couple of minutes.
> > >>>>> It also looks like the block and block.db links
> > >>>>> in /var/lib/ceph/UUID/osd.N/ are not correctly set.
> > >>>>> When we restart the daemons that needed a host restart, the OSD
> > doesn't
> > >>>>> come up and needs a host restart.
> > >>>>>
> > >>>>> All OSDs are encrypted.
> > >>>>>
> > >>>>> Does anyone got some ideas how to debug further?
> > >>>>> _______________________________________________
> > >>>>> ceph-users mailing list -- [email protected]
> > >>>>> To unsubscribe send an email to [email protected]
> > >>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> >
> >
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: OSDs won't start after upgrading reef (18.2.7) to squid (19.2.3) with orchestrator

Reply via email to