Hi Malte, we just upped the timeout in the service file to 720 (cronjob is going to set it every minute). Starting the OSDs takes around 4 minutes. We still think this is an issue with the ceph-volume activate, because this is taking ages. As soon as the cryptsetup open the luks devices the rest is going as normal.
ceph-volume log seems to reflect that: root@s3db15:~# ls -alh /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log* -rw-r--r-- 1 root root 2.4G Jan 27 15:59 /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log -rw-r--r-- 1 root root 680M Jan 27 00:00 /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.1.gz -rw-r--r-- 1 root root 5.0M Jan 25 23:46 /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.2.gz -rw-r--r-- 1 root root 4.8M Jan 24 23:35 /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.3.gz -rw-r--r-- 1 root root 4.8M Jan 23 23:45 /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.4.gz -rw-r--r-- 1 root root 4.8M Jan 22 23:38 /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.5.gz -rw-r--r-- 1 root root 4.8M Jan 21 23:39 /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.6.gz -rw-r--r-- 1 root root 4.8M Jan 20 23:38 /var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.7.gz --- # time systemctl restart [email protected] real 3m46.251s user 0m0.060s sys 0m0.063s --- I will prepare an output of the ceph-collect from 42on. But I need to remove the sensitive stuff like the rgw config and so. But I don't think this will give the right clues, as the output won't show what is happening during the start. Am Di., 27. Jan. 2026 um 13:51 Uhr schrieb Malte Stroem < [email protected]>: > We do not know a lot about you cluster, so it's hard to help. > > Give us alle the information one needs. > > ceph -s, ceph orch host ls, ceph health detail, all the good things to > get an overview. > > On 1/27/26 12:43, Boris via ceph-users wrote: > > All disks are 8TB HDD. We have some 16TB HDDs but those are all in the > > newest host, that updated just fine. > > > > Am Di., 27. Jan. 2026 um 12:39 Uhr schrieb Malte Stroem < > > [email protected]>: > > > >> From what I can see, the OSD is running. It's starting up and still > >> needs time. > >> > >> How big is the disk? > >> > >> On 1/27/26 12:31, Boris via ceph-users wrote: > >>> Sure: https://pastebin.com/9RLzyUQs > >>> > >>> I've trimmed the log a little bit (removed peering, epoch, trim and so > >> on). > >>> This is the last OSD that we tried that did not work. > >>> > >>> We tried another host, where the upgrade just went through. But this > Host > >>> also got the newest hardware. > >>> But we don't think it is a hardware issue, because the first 30 OSDs > were > >>> on the two oldest hosts and the first one that failed was on the same > >> host > >>> as the last OSD that did not fail. > >>> > >>> > >>> > >>> Am Di., 27. Jan. 2026 um 12:05 Uhr schrieb Malte Stroem < > >>> [email protected]>: > >>> > >>>> Could be the kind of hardware you are using. Is it different from the > >>>> other clusters' hardware? > >>>> > >>>> Send us logs, so we can help you out. > >>>> > >>>> Example: > >>>> > >>>> journalctl -eu [email protected] > >>>> > >>>> Best, > >>>> Malte > >>>> > >>>> On 1/27/26 11:55, Boris via ceph-users wrote: > >>>>> Hi, > >>>>> we are currently facing an issue, that suddenly none of the OSDs will > >>>> start > >>>>> after the container started with the new versions. > >>>>> > >>>>> This seems to be an issue with some hosts/OSDs. The first 30 OSDs > >> worked, > >>>>> but took really long (like 5 hours) and then every single OSD after > >> that > >>>>> needed a host reboot to bring the disk back up and continue the > update. > >>>>> > >>>>> We've stopped after 6 tries. > >>>>> > >>>>> And one disk never came back up. We removed and zapped the OSD. The > >>>>> orchestrator picked the available disk and recreated it. It came up > >>>> within > >>>>> seconds. > >>>>> > >>>>> We have around 90 clusters and this happened only on a single one. > All > >>>>> others updates within two hours without any issues. > >>>>> > >>>>> The cluster uses HDDs (8TB) with the block.db on SSD (5 block.db per > >>>> SSD). > >>>>> The file /var/log/ceph/UUID/ceph-volume.log get hammered with a lot > of > >>>>> output from udevadm, lsblk and nsenter > >>>>> The activation container (ceph-UUID-osd-N-activate) gets killed > after a > >>>>> couple of minutes. > >>>>> It also looks like the block and block.db links > >>>>> in /var/lib/ceph/UUID/osd.N/ are not correctly set. > >>>>> When we restart the daemons that needed a host restart, the OSD > doesn't > >>>>> come up and needs a host restart. > >>>>> > >>>>> All OSDs are encrypted. > >>>>> > >>>>> Does anyone got some ideas how to debug further? > >>>>> _______________________________________________ > >>>>> ceph-users mailing list -- [email protected] > >>>>> To unsubscribe send an email to [email protected] > >>>> > >>>> > >>> > >> > >> > > > > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
