Yes, it is crazy. You can check the /var/log/ceph/FSID/ceph-volume.log file and see the amount of call goes nuts. Faster and newer CPUs seems to handle these calls okayish, but then old Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz are struggling very hard :)
Am Do., 29. Jan. 2026 um 16:06 Uhr schrieb Eugen Block <[email protected]>: > I was thinking about the same bug you commented, Boris: > > https://tracker.ceph.com/issues/73107 > <https://tracker.ceph.com/issues/73107#change-331393> > > I am also subscribed to that bug because we upgraded to 19.2.3 a couple of > months ago. But since we don’t have that many OSDs per host, we weren’t > facing any impact as described in the tracker. > > Boris via ceph-users <[email protected]> schrieb am Mi. 28. Jan. 2026 um > 22:19: > >> Hi Malte, >> we just upped the timeout in the service file to 720 (cronjob is going to >> set it every minute). >> Starting the OSDs takes around 4 minutes. We still think this is an issue >> with the ceph-volume activate, because this is taking ages. As soon as the >> cryptsetup open the luks devices the rest is going as normal. >> >> ceph-volume log seems to reflect that: >> root@s3db15:~# ls -alh >> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log* >> -rw-r--r-- 1 root root 2.4G Jan 27 15:59 >> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log >> -rw-r--r-- 1 root root 680M Jan 27 00:00 >> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.1.gz >> -rw-r--r-- 1 root root 5.0M Jan 25 23:46 >> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.2.gz >> -rw-r--r-- 1 root root 4.8M Jan 24 23:35 >> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.3.gz >> -rw-r--r-- 1 root root 4.8M Jan 23 23:45 >> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.4.gz >> -rw-r--r-- 1 root root 4.8M Jan 22 23:38 >> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.5.gz >> -rw-r--r-- 1 root root 4.8M Jan 21 23:39 >> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.6.gz >> -rw-r--r-- 1 root root 4.8M Jan 20 23:38 >> /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.7.gz >> --- >> # time systemctl restart [email protected] >> >> real 3m46.251s >> user 0m0.060s >> sys 0m0.063s >> --- >> I will prepare an output of the ceph-collect from 42on. But I need to >> remove the sensitive stuff like the rgw config and so. But I don't think >> this will give the right clues, as the output won't show what is happening >> during the start. >> >> >> Am Di., 27. Jan. 2026 um 13:51 Uhr schrieb Malte Stroem < >> [email protected]>: >> >> > We do not know a lot about you cluster, so it's hard to help. >> > >> > Give us alle the information one needs. >> > >> > ceph -s, ceph orch host ls, ceph health detail, all the good things to >> > get an overview. >> > >> > On 1/27/26 12:43, Boris via ceph-users wrote: >> > > All disks are 8TB HDD. We have some 16TB HDDs but those are all in the >> > > newest host, that updated just fine. >> > > >> > > Am Di., 27. Jan. 2026 um 12:39 Uhr schrieb Malte Stroem < >> > > [email protected]>: >> > > >> > >> From what I can see, the OSD is running. It's starting up and still >> > >> needs time. >> > >> >> > >> How big is the disk? >> > >> >> > >> On 1/27/26 12:31, Boris via ceph-users wrote: >> > >>> Sure: https://pastebin.com/9RLzyUQs >> > >>> >> > >>> I've trimmed the log a little bit (removed peering, epoch, trim and >> so >> > >> on). >> > >>> This is the last OSD that we tried that did not work. >> > >>> >> > >>> We tried another host, where the upgrade just went through. But this >> > Host >> > >>> also got the newest hardware. >> > >>> But we don't think it is a hardware issue, because the first 30 OSDs >> > were >> > >>> on the two oldest hosts and the first one that failed was on the >> same >> > >> host >> > >>> as the last OSD that did not fail. >> > >>> >> > >>> >> > >>> >> > >>> Am Di., 27. Jan. 2026 um 12:05 Uhr schrieb Malte Stroem < >> > >>> [email protected]>: >> > >>> >> > >>>> Could be the kind of hardware you are using. Is it different from >> the >> > >>>> other clusters' hardware? >> > >>>> >> > >>>> Send us logs, so we can help you out. >> > >>>> >> > >>>> Example: >> > >>>> >> > >>>> journalctl -eu [email protected] >> > >>>> >> > >>>> Best, >> > >>>> Malte >> > >>>> >> > >>>> On 1/27/26 11:55, Boris via ceph-users wrote: >> > >>>>> Hi, >> > >>>>> we are currently facing an issue, that suddenly none of the OSDs >> will >> > >>>> start >> > >>>>> after the container started with the new versions. >> > >>>>> >> > >>>>> This seems to be an issue with some hosts/OSDs. The first 30 OSDs >> > >> worked, >> > >>>>> but took really long (like 5 hours) and then every single OSD >> after >> > >> that >> > >>>>> needed a host reboot to bring the disk back up and continue the >> > update. >> > >>>>> >> > >>>>> We've stopped after 6 tries. >> > >>>>> >> > >>>>> And one disk never came back up. We removed and zapped the OSD. >> The >> > >>>>> orchestrator picked the available disk and recreated it. It came >> up >> > >>>> within >> > >>>>> seconds. >> > >>>>> >> > >>>>> We have around 90 clusters and this happened only on a single one. >> > All >> > >>>>> others updates within two hours without any issues. >> > >>>>> >> > >>>>> The cluster uses HDDs (8TB) with the block.db on SSD (5 block.db >> per >> > >>>> SSD). >> > >>>>> The file /var/log/ceph/UUID/ceph-volume.log get hammered with a >> lot >> > of >> > >>>>> output from udevadm, lsblk and nsenter >> > >>>>> The activation container (ceph-UUID-osd-N-activate) gets killed >> > after a >> > >>>>> couple of minutes. >> > >>>>> It also looks like the block and block.db links >> > >>>>> in /var/lib/ceph/UUID/osd.N/ are not correctly set. >> > >>>>> When we restart the daemons that needed a host restart, the OSD >> > doesn't >> > >>>>> come up and needs a host restart. >> > >>>>> >> > >>>>> All OSDs are encrypted. >> > >>>>> >> > >>>>> Does anyone got some ideas how to debug further? >> > >>>>> _______________________________________________ >> > >>>>> ceph-users mailing list -- [email protected] >> > >>>>> To unsubscribe send an email to [email protected] >> > >>>> >> > >>>> >> > >>> >> > >> >> > >> >> > > >> > >> > >> >> -- >> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im >> groüen Saal. >> _______________________________________________ >> ceph-users mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
