I was thinking about the same bug you commented, Boris: https://tracker.ceph.com/issues/73107 <https://tracker.ceph.com/issues/73107#change-331393>
I am also subscribed to that bug because we upgraded to 19.2.3 a couple of months ago. But since we don’t have that many OSDs per host, we weren’t facing any impact as described in the tracker. Boris via ceph-users <[email protected]> schrieb am Mi. 28. Jan. 2026 um 22:19: > Hi Malte, > we just upped the timeout in the service file to 720 (cronjob is going to > set it every minute). > Starting the OSDs takes around 4 minutes. We still think this is an issue > with the ceph-volume activate, because this is taking ages. As soon as the > cryptsetup open the luks devices the rest is going as normal. > > ceph-volume log seems to reflect that: > root@s3db15:~# ls -alh > /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log* > -rw-r--r-- 1 root root 2.4G Jan 27 15:59 > /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log > -rw-r--r-- 1 root root 680M Jan 27 00:00 > /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.1.gz > -rw-r--r-- 1 root root 5.0M Jan 25 23:46 > /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.2.gz > -rw-r--r-- 1 root root 4.8M Jan 24 23:35 > /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.3.gz > -rw-r--r-- 1 root root 4.8M Jan 23 23:45 > /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.4.gz > -rw-r--r-- 1 root root 4.8M Jan 22 23:38 > /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.5.gz > -rw-r--r-- 1 root root 4.8M Jan 21 23:39 > /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.6.gz > -rw-r--r-- 1 root root 4.8M Jan 20 23:38 > /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.7.gz > --- > # time systemctl restart [email protected] > > real 3m46.251s > user 0m0.060s > sys 0m0.063s > --- > I will prepare an output of the ceph-collect from 42on. But I need to > remove the sensitive stuff like the rgw config and so. But I don't think > this will give the right clues, as the output won't show what is happening > during the start. > > > Am Di., 27. Jan. 2026 um 13:51 Uhr schrieb Malte Stroem < > [email protected]>: > > > We do not know a lot about you cluster, so it's hard to help. > > > > Give us alle the information one needs. > > > > ceph -s, ceph orch host ls, ceph health detail, all the good things to > > get an overview. > > > > On 1/27/26 12:43, Boris via ceph-users wrote: > > > All disks are 8TB HDD. We have some 16TB HDDs but those are all in the > > > newest host, that updated just fine. > > > > > > Am Di., 27. Jan. 2026 um 12:39 Uhr schrieb Malte Stroem < > > > [email protected]>: > > > > > >> From what I can see, the OSD is running. It's starting up and still > > >> needs time. > > >> > > >> How big is the disk? > > >> > > >> On 1/27/26 12:31, Boris via ceph-users wrote: > > >>> Sure: https://pastebin.com/9RLzyUQs > > >>> > > >>> I've trimmed the log a little bit (removed peering, epoch, trim and > so > > >> on). > > >>> This is the last OSD that we tried that did not work. > > >>> > > >>> We tried another host, where the upgrade just went through. But this > > Host > > >>> also got the newest hardware. > > >>> But we don't think it is a hardware issue, because the first 30 OSDs > > were > > >>> on the two oldest hosts and the first one that failed was on the same > > >> host > > >>> as the last OSD that did not fail. > > >>> > > >>> > > >>> > > >>> Am Di., 27. Jan. 2026 um 12:05 Uhr schrieb Malte Stroem < > > >>> [email protected]>: > > >>> > > >>>> Could be the kind of hardware you are using. Is it different from > the > > >>>> other clusters' hardware? > > >>>> > > >>>> Send us logs, so we can help you out. > > >>>> > > >>>> Example: > > >>>> > > >>>> journalctl -eu [email protected] > > >>>> > > >>>> Best, > > >>>> Malte > > >>>> > > >>>> On 1/27/26 11:55, Boris via ceph-users wrote: > > >>>>> Hi, > > >>>>> we are currently facing an issue, that suddenly none of the OSDs > will > > >>>> start > > >>>>> after the container started with the new versions. > > >>>>> > > >>>>> This seems to be an issue with some hosts/OSDs. The first 30 OSDs > > >> worked, > > >>>>> but took really long (like 5 hours) and then every single OSD after > > >> that > > >>>>> needed a host reboot to bring the disk back up and continue the > > update. > > >>>>> > > >>>>> We've stopped after 6 tries. > > >>>>> > > >>>>> And one disk never came back up. We removed and zapped the OSD. The > > >>>>> orchestrator picked the available disk and recreated it. It came up > > >>>> within > > >>>>> seconds. > > >>>>> > > >>>>> We have around 90 clusters and this happened only on a single one. > > All > > >>>>> others updates within two hours without any issues. > > >>>>> > > >>>>> The cluster uses HDDs (8TB) with the block.db on SSD (5 block.db > per > > >>>> SSD). > > >>>>> The file /var/log/ceph/UUID/ceph-volume.log get hammered with a lot > > of > > >>>>> output from udevadm, lsblk and nsenter > > >>>>> The activation container (ceph-UUID-osd-N-activate) gets killed > > after a > > >>>>> couple of minutes. > > >>>>> It also looks like the block and block.db links > > >>>>> in /var/lib/ceph/UUID/osd.N/ are not correctly set. > > >>>>> When we restart the daemons that needed a host restart, the OSD > > doesn't > > >>>>> come up and needs a host restart. > > >>>>> > > >>>>> All OSDs are encrypted. > > >>>>> > > >>>>> Does anyone got some ideas how to debug further? > > >>>>> _______________________________________________ > > >>>>> ceph-users mailing list -- [email protected] > > >>>>> To unsubscribe send an email to [email protected] > > >>>> > > >>>> > > >>> > > >> > > >> > > > > > > > > > -- > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im > groüen Saal. > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
