[ceph-users] Re: [v19.2.3] All OSDs are not created with a managed spec

Eugen Block Wed, 20 Aug 2025 07:32:44 -0700

Yeah, it rang a bell when you mentioned the timeout. At least oneother user reported this here on this list not long ago, but I forgotabout that already.


More responses inline...


Zitat von Gilles Mocellin <gilles.mocel...@nuagelibre.org>:

Le 2025-08-19 18:19, Gilles Mocellin a écrit :
Hi,

New try, from scratch every device clean (no Physical Volume).

I've found that issue and recomandations :
https://access.redhat.com/solutions/6545511
https://www.ibm.com/docs/en/storage-ceph/8.0.0?topic=80-bug-fixes

So I set a higher timeout for cephadm commands :
ceph config set global mgr/cephadm/default_cephadm_command_timeout 1800
Default was 900.
I see le orchestrator launching ceph-volume commands with a timeoutof 1795 (it was 895 before, don't know why it's 5s less than theconfig...).
But, I've done that change after creating the OSD spec, so somedaemon have failed/timed out still :
root@fidcl-lyo1-sto-sds-lab-01:~# ceph health detail
HEALTH_WARN Failed to apply 1 service(s): osd.throughput_optimized;7 failed cephadm daemon(s); noout flag(s) set[WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s):osd.throughput_optimizedosd.throughput_optimized: Command timed out on host cephadmdeploy (osd daemon) (default 1800 second timeout)
[WRN] CEPHADM_FAILED_DAEMON: 7 failed cephadm daemon(s)
   daemon osd.115 on fidcl-lyo1-sto-sds-lab-01 is in unknown state
   daemon osd.24 on fidcl-lyo1-sto-sds-lab-02 is in unknown state
   daemon osd.116 on fidcl-lyo1-sto-sds-lab-03 is in unknown state
   daemon osd.118 on fidcl-lyo1-sto-sds-lab-04 is in unknown state
   daemon osd.23 on fidcl-lyo1-sto-sds-lab-05 is in unknown state
   daemon osd.14 on fidcl-lyo1-sto-sds-lab-06 is in unknown state
   daemon osd.117 on fidcl-lyo1-sto-sds-lab-07 is in unknown state

As in the RedHat issue, I've launched on every host :
systemctl daemon-reload
systemctl reset-failed
[...]
The OSDs are only visible in ceph osd ls command and in the dashboard.
Daemons are not started, but PV/VG/LV are created.
I will forget dmcrypt, I think with LVs tags, it's really easier tofind the link between an OSD and it's LVs...
_______________________________________________
After several rebuild, I can say now that :
- dmcrypt does not play a role
- It's a timeout issue, my OSD creations stop ~15min afeter OSD specis applied (default cephadm timeout is 900s, 895 seen in the logs).
If I create an empty cluster, set a bigger timeout with :
ceph config set global mgr/cephadm/default_cephadm_command_timeout 1800

Then apply my OSD spec, it works.
When I tried that last time, I change the timeout during cephadmexecution, so not all cephadm commands used it, and the errorreporting the timeout certainly show the new timeout (1800)...
What I've seen that certainly could be improved is the non-atomic,even non serialized operations that leads to :
1- all OSD are created (OSD as in `ceph osd ls`
2- devices are created (PV/VG/LV/dmcrypt)
3- then one OSD by one, the folder /var/lib/ceph/FSID/osd.ID withblock and block.db links
4- and the systemd unit is created and started

The timeout happens during the step 3.

This has also already been brought up, not sure if here or on Slackthough. It seems like one of the suboptimal default settings that workfor most use cases, but not all. Maybe a note in the docs couldsuffice to increase the timeout when the operator intends to deploymany OSDs per node at once. Not sure if adding one more option duringbootstrap is worth the hastle.

As all the devices are now in use and ceph OSD ID already created,there is no means to recover I think and restart...


Not sure I understand what you mean.

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [v19.2.3] All OSDs are not created with a managed spec

Reply via email to