Hi Malte,
we just upped the timeout in the service file to 720 (cronjob is going to
set it every minute).
Starting the OSDs takes around 4 minutes. We still think this is an issue
with the ceph-volume activate, because this is taking ages. As soon as the
cryptsetup open the luks devices the rest is going as normal.

ceph-volume log seems to reflect that:
root@s3db15:~# ls -alh
/var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log*
-rw-r--r-- 1 root root 2.4G Jan 27 15:59
/var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log
-rw-r--r-- 1 root root 680M Jan 27 00:00
/var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.1.gz
-rw-r--r-- 1 root root 5.0M Jan 25 23:46
/var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.2.gz
-rw-r--r-- 1 root root 4.8M Jan 24 23:35
/var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.3.gz
-rw-r--r-- 1 root root 4.8M Jan 23 23:45
/var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.4.gz
-rw-r--r-- 1 root root 4.8M Jan 22 23:38
/var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.5.gz
-rw-r--r-- 1 root root 4.8M Jan 21 23:39
/var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.6.gz
-rw-r--r-- 1 root root 4.8M Jan 20 23:38
/var/log/ceph/dca79fff-ffd0-58f4-1cff-82a2feea05f4/ceph-volume.log.7.gz
---
# time systemctl restart [email protected]

real 3m46.251s
user 0m0.060s
sys 0m0.063s
---
I will prepare an output of the ceph-collect from 42on. But I need to
remove the sensitive stuff like the rgw config and so. But I don't think
this will give the right clues, as the output won't show what is happening
during the start.


Am Di., 27. Jan. 2026 um 13:51 Uhr schrieb Malte Stroem <
[email protected]>:

> We do not know a lot about you cluster, so it's hard to help.
>
> Give us alle the information one needs.
>
> ceph -s, ceph orch host ls, ceph health detail, all the good things to
> get an overview.
>
> On 1/27/26 12:43, Boris via ceph-users wrote:
> > All disks are 8TB HDD. We have some 16TB HDDs but those are all in the
> > newest host, that updated just fine.
> >
> > Am Di., 27. Jan. 2026 um 12:39 Uhr schrieb Malte Stroem <
> > [email protected]>:
> >
> >>   From what I can see, the OSD is running. It's starting up and still
> >> needs time.
> >>
> >> How big is the disk?
> >>
> >> On 1/27/26 12:31, Boris via ceph-users wrote:
> >>> Sure: https://pastebin.com/9RLzyUQs
> >>>
> >>> I've trimmed the log a little bit (removed peering, epoch, trim and so
> >> on).
> >>> This is the last OSD that we tried that did not work.
> >>>
> >>> We tried another host, where the upgrade just went through. But this
> Host
> >>> also got the newest hardware.
> >>> But we don't think it is a hardware issue, because the first 30 OSDs
> were
> >>> on the two oldest hosts and the first one that failed was on the same
> >> host
> >>> as the last OSD that did not fail.
> >>>
> >>>
> >>>
> >>> Am Di., 27. Jan. 2026 um 12:05 Uhr schrieb Malte Stroem <
> >>> [email protected]>:
> >>>
> >>>> Could be the kind of hardware you are using. Is it different from the
> >>>> other clusters' hardware?
> >>>>
> >>>> Send us logs, so we can help you out.
> >>>>
> >>>> Example:
> >>>>
> >>>> journalctl -eu [email protected]
> >>>>
> >>>> Best,
> >>>> Malte
> >>>>
> >>>> On 1/27/26 11:55, Boris via ceph-users wrote:
> >>>>> Hi,
> >>>>> we are currently facing an issue, that suddenly none of the OSDs will
> >>>> start
> >>>>> after the container started with the new versions.
> >>>>>
> >>>>> This seems to be an issue with some hosts/OSDs. The first 30 OSDs
> >> worked,
> >>>>> but took really long (like 5 hours) and then every single OSD after
> >> that
> >>>>> needed a host reboot to bring the disk back up and continue the
> update.
> >>>>>
> >>>>> We've stopped after 6 tries.
> >>>>>
> >>>>> And one disk never came back up. We removed and zapped the OSD. The
> >>>>> orchestrator picked the available disk and recreated it. It came up
> >>>> within
> >>>>> seconds.
> >>>>>
> >>>>> We have around 90 clusters and this happened only on a single one.
> All
> >>>>> others updates within two hours without any issues.
> >>>>>
> >>>>> The cluster uses HDDs (8TB) with the block.db on SSD (5 block.db per
> >>>> SSD).
> >>>>> The file /var/log/ceph/UUID/ceph-volume.log get hammered with a lot
> of
> >>>>> output from udevadm, lsblk and nsenter
> >>>>> The activation container (ceph-UUID-osd-N-activate) gets killed
> after a
> >>>>> couple of minutes.
> >>>>> It also looks like the block and block.db links
> >>>>> in /var/lib/ceph/UUID/osd.N/ are not correctly set.
> >>>>> When we restart the daemons that needed a host restart, the OSD
> doesn't
> >>>>> come up and needs a host restart.
> >>>>>
> >>>>> All OSDs are encrypted.
> >>>>>
> >>>>> Does anyone got some ideas how to debug further?
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- [email protected]
> >>>>> To unsubscribe send an email to [email protected]
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to