Yeah, it rang a bell when you mentioned the timeout. At least one other user reported this here on this list not long ago, but I forgot about that already.

More responses inline...

Zitat von Gilles Mocellin <gilles.mocel...@nuagelibre.org>:

Le 2025-08-19 18:19, Gilles Mocellin a écrit :
Hi,

New try, from scratch every device clean (no Physical Volume).

I've found that issue and recomandations :
https://access.redhat.com/solutions/6545511
https://www.ibm.com/docs/en/storage-ceph/8.0.0?topic=80-bug-fixes

So I set a higher timeout for cephadm commands :
ceph config set global mgr/cephadm/default_cephadm_command_timeout 1800
Default was 900.
I see le orchestrator launching ceph-volume commands with a timeout of 1795 (it was 895 before, don't know why it's 5s less than the config...).

But, I've done that change after creating the OSD spec, so some daemon have failed/timed out still :

root@fidcl-lyo1-sto-sds-lab-01:~# ceph health detail
HEALTH_WARN Failed to apply 1 service(s): osd.throughput_optimized; 7 failed cephadm daemon(s); noout flag(s) set [WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s): osd.throughput_optimized osd.throughput_optimized: Command timed out on host cephadm deploy (osd daemon) (default 1800 second timeout)
[WRN] CEPHADM_FAILED_DAEMON: 7 failed cephadm daemon(s)
   daemon osd.115 on fidcl-lyo1-sto-sds-lab-01 is in unknown state
   daemon osd.24 on fidcl-lyo1-sto-sds-lab-02 is in unknown state
   daemon osd.116 on fidcl-lyo1-sto-sds-lab-03 is in unknown state
   daemon osd.118 on fidcl-lyo1-sto-sds-lab-04 is in unknown state
   daemon osd.23 on fidcl-lyo1-sto-sds-lab-05 is in unknown state
   daemon osd.14 on fidcl-lyo1-sto-sds-lab-06 is in unknown state
   daemon osd.117 on fidcl-lyo1-sto-sds-lab-07 is in unknown state

As in the RedHat issue, I've launched on every host :
systemctl daemon-reload
systemctl reset-failed

[...]

The OSDs are only visible in ceph osd ls command and in the dashboard.
Daemons are not started, but PV/VG/LV are created.


I will forget dmcrypt, I think with LVs tags, it's really easier to find the link between an OSD and it's LVs...
_______________________________________________

After several rebuild, I can say now that :
- dmcrypt does not play a role
- It's a timeout issue, my OSD creations stop ~15min afeter OSD spec is applied (default cephadm timeout is 900s, 895 seen in the logs).

If I create an empty cluster, set a bigger timeout with :
ceph config set global mgr/cephadm/default_cephadm_command_timeout 1800

Then apply my OSD spec, it works.

When I tried that last time, I change the timeout during cephadm execution, so not all cephadm commands used it, and the error reporting the timeout certainly show the new timeout (1800)...

What I've seen that certainly could be improved is the non-atomic, even non serialized operations that leads to :
1- all OSD are created (OSD as in `ceph osd ls`
2- devices are created (PV/VG/LV/dmcrypt)
3- then one OSD by one, the folder /var/lib/ceph/FSID/osd.ID with block and block.db links
4- and the systemd unit is created and started

The timeout happens during the step 3.


This has also already been brought up, not sure if here or on Slack though. It seems like one of the suboptimal default settings that work for most use cases, but not all. Maybe a note in the docs could suffice to increase the timeout when the operator intends to deploy many OSDs per node at once. Not sure if adding one more option during bootstrap is worth the hastle.


As all the devices are now in use and ceph OSD ID already created, there is no means to recover I think and restart...

Not sure I understand what you mean.

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to