[ceph-users] Re: OSDs won't start after upgrading reef (18.2.7) to squid (19.2.3) with orchestrator

Boris via ceph-users Thu, 29 Jan 2026 07:41:05 -0800

Yes, it is crazy. You can check the /var/log/ceph/FSID/ceph-volume.log file
and see the amount of call goes nuts. Faster and newer CPUs seems to handle
these calls okayish, but then old Intel(R) Xeon(R) Silver 4116 CPU @
2.10GHz are struggling very hard :)


Am Do., 29. Jan. 2026 um 16:06 Uhr schrieb Eugen Block <[email protected]>:

> I was thinking about the same bug you commented, Boris:
>
> https://tracker.ceph.com/issues/73107
> <https://tracker.ceph.com/issues/73107#change-331393>
>
> I am also subscribed to that bug because we upgraded to 19.2.3 a couple of
> months ago. But since we don’t have that many OSDs per host, we weren’t
> facing any impact as described in the tracker.
>
> Boris via ceph-users <[email protected]> schrieb am Mi. 28. Jan. 2026 um
> 22:19:
>
>> Hi Malte,
>> we just upped the timeout in the service file to 720 (cronjob is going to
>> set it every minute).
>> Starting the OSDs takes around 4 minutes. We still think this is an issue
>> with the ceph-volume activate, because this is taking ages. As soon as the
>> cryptsetup open the luks devices the rest is going as normal.
>>
>> ceph-volume log seems to reflect that:
>> root@s3db15:~# ls -alh
>> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log*
>> -rw-r--r-- 1 root root 2.4G Jan 27 15:59
>> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log
>> -rw-r--r-- 1 root root 680M Jan 27 00:00
>> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.1.gz
>> -rw-r--r-- 1 root root 5.0M Jan 25 23:46
>> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.2.gz
>> -rw-r--r-- 1 root root 4.8M Jan 24 23:35
>> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.3.gz
>> -rw-r--r-- 1 root root 4.8M Jan 23 23:45
>> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.4.gz
>> -rw-r--r-- 1 root root 4.8M Jan 22 23:38
>> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.5.gz
>> -rw-r--r-- 1 root root 4.8M Jan 21 23:39
>> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.6.gz
>> -rw-r--r-- 1 root root 4.8M Jan 20 23:38
>> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.7.gz
>> ---
>> # time systemctl restart [email protected]
>>
>> real 3m46.251s
>> user 0m0.060s
>> sys 0m0.063s
>> ---
>> I will prepare an output of the ceph-collect from 42on. But I need to
>> remove the sensitive stuff like the rgw config and so. But I don't think
>> this will give the right clues, as the output won't show what is happening
>> during the start.
>>
>>
>> Am Di., 27. Jan. 2026 um 13:51 Uhr schrieb Malte Stroem <
>> [email protected]>:
>>
>> > We do not know a lot about you cluster, so it's hard to help.
>> >
>> > Give us alle the information one needs.
>> >
>> > ceph -s, ceph orch host ls, ceph health detail, all the good things to
>> > get an overview.
>> >
>> > On 1/27/26 12:43, Boris via ceph-users wrote:
>> > > All disks are 8TB HDD. We have some 16TB HDDs but those are all in the
>> > > newest host, that updated just fine.
>> > >
>> > > Am Di., 27. Jan. 2026 um 12:39 Uhr schrieb Malte Stroem <
>> > > [email protected]>:
>> > >
>> > >>   From what I can see, the OSD is running. It's starting up and still
>> > >> needs time.
>> > >>
>> > >> How big is the disk?
>> > >>
>> > >> On 1/27/26 12:31, Boris via ceph-users wrote:
>> > >>> Sure: https://pastebin.com/9RLzyUQs
>> > >>>
>> > >>> I've trimmed the log a little bit (removed peering, epoch, trim and
>> so
>> > >> on).
>> > >>> This is the last OSD that we tried that did not work.
>> > >>>
>> > >>> We tried another host, where the upgrade just went through. But this
>> > Host
>> > >>> also got the newest hardware.
>> > >>> But we don't think it is a hardware issue, because the first 30 OSDs
>> > were
>> > >>> on the two oldest hosts and the first one that failed was on the
>> same
>> > >> host
>> > >>> as the last OSD that did not fail.
>> > >>>
>> > >>>
>> > >>>
>> > >>> Am Di., 27. Jan. 2026 um 12:05 Uhr schrieb Malte Stroem <
>> > >>> [email protected]>:
>> > >>>
>> > >>>> Could be the kind of hardware you are using. Is it different from
>> the
>> > >>>> other clusters' hardware?
>> > >>>>
>> > >>>> Send us logs, so we can help you out.
>> > >>>>
>> > >>>> Example:
>> > >>>>
>> > >>>> journalctl -eu [email protected]
>> > >>>>
>> > >>>> Best,
>> > >>>> Malte
>> > >>>>
>> > >>>> On 1/27/26 11:55, Boris via ceph-users wrote:
>> > >>>>> Hi,
>> > >>>>> we are currently facing an issue, that suddenly none of the OSDs
>> will
>> > >>>> start
>> > >>>>> after the container started with the new versions.
>> > >>>>>
>> > >>>>> This seems to be an issue with some hosts/OSDs. The first 30 OSDs
>> > >> worked,
>> > >>>>> but took really long (like 5 hours) and then every single OSD
>> after
>> > >> that
>> > >>>>> needed a host reboot to bring the disk back up and continue the
>> > update.
>> > >>>>>
>> > >>>>> We've stopped after 6 tries.
>> > >>>>>
>> > >>>>> And one disk never came back up. We removed and zapped the OSD.
>> The
>> > >>>>> orchestrator picked the available disk and recreated it. It came
>> up
>> > >>>> within
>> > >>>>> seconds.
>> > >>>>>
>> > >>>>> We have around 90 clusters and this happened only on a single one.
>> > All
>> > >>>>> others updates within two hours without any issues.
>> > >>>>>
>> > >>>>> The cluster uses HDDs (8TB) with the block.db on SSD (5 block.db
>> per
>> > >>>> SSD).
>> > >>>>> The file /var/log/ceph/UUID/ceph-volume.log get hammered with a
>> lot
>> > of
>> > >>>>> output from udevadm, lsblk and nsenter
>> > >>>>> The activation container (ceph-UUID-osd-N-activate) gets killed
>> > after a
>> > >>>>> couple of minutes.
>> > >>>>> It also looks like the block and block.db links
>> > >>>>> in /var/lib/ceph/UUID/osd.N/ are not correctly set.
>> > >>>>> When we restart the daemons that needed a host restart, the OSD
>> > doesn't
>> > >>>>> come up and needs a host restart.
>> > >>>>>
>> > >>>>> All OSDs are encrypted.
>> > >>>>>
>> > >>>>> Does anyone got some ideas how to debug further?
>> > >>>>> _______________________________________________
>> > >>>>> ceph-users mailing list -- [email protected]
>> > >>>>> To unsubscribe send an email to [email protected]
>> > >>>>
>> > >>>>
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>> >
>>
>> --
>> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>> groÃƒ¼en Saal.
>> _______________________________________________
>> ceph-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: OSDs won't start after upgrading reef (18.2.7) to squid (19.2.3) with orchestrator

Reply via email to