Yeah, it rang a bell when you mentioned the timeout. At least one
other user reported this here on this list not long ago, but I forgot
about that already.
More responses inline...
Zitat von Gilles Mocellin <gilles.mocel...@nuagelibre.org>:
Le 2025-08-19 18:19, Gilles Mocellin a écrit :
Hi,
New try, from scratch every device clean (no Physical Volume).
I've found that issue and recomandations :
https://access.redhat.com/solutions/6545511
https://www.ibm.com/docs/en/storage-ceph/8.0.0?topic=80-bug-fixes
So I set a higher timeout for cephadm commands :
ceph config set global mgr/cephadm/default_cephadm_command_timeout 1800
Default was 900.
I see le orchestrator launching ceph-volume commands with a timeout
of 1795 (it was 895 before, don't know why it's 5s less than the
config...).
But, I've done that change after creating the OSD spec, so some
daemon have failed/timed out still :
root@fidcl-lyo1-sto-sds-lab-01:~# ceph health detail
HEALTH_WARN Failed to apply 1 service(s): osd.throughput_optimized;
7 failed cephadm daemon(s); noout flag(s) set
[WRN] CEPHADM_APPLY_SPEC_FAIL: Failed to apply 1 service(s):
osd.throughput_optimized
osd.throughput_optimized: Command timed out on host cephadm
deploy (osd daemon) (default 1800 second timeout)
[WRN] CEPHADM_FAILED_DAEMON: 7 failed cephadm daemon(s)
daemon osd.115 on fidcl-lyo1-sto-sds-lab-01 is in unknown state
daemon osd.24 on fidcl-lyo1-sto-sds-lab-02 is in unknown state
daemon osd.116 on fidcl-lyo1-sto-sds-lab-03 is in unknown state
daemon osd.118 on fidcl-lyo1-sto-sds-lab-04 is in unknown state
daemon osd.23 on fidcl-lyo1-sto-sds-lab-05 is in unknown state
daemon osd.14 on fidcl-lyo1-sto-sds-lab-06 is in unknown state
daemon osd.117 on fidcl-lyo1-sto-sds-lab-07 is in unknown state
As in the RedHat issue, I've launched on every host :
systemctl daemon-reload
systemctl reset-failed
[...]
The OSDs are only visible in ceph osd ls command and in the dashboard.
Daemons are not started, but PV/VG/LV are created.
I will forget dmcrypt, I think with LVs tags, it's really easier to
find the link between an OSD and it's LVs...
_______________________________________________
After several rebuild, I can say now that :
- dmcrypt does not play a role
- It's a timeout issue, my OSD creations stop ~15min afeter OSD spec
is applied (default cephadm timeout is 900s, 895 seen in the logs).
If I create an empty cluster, set a bigger timeout with :
ceph config set global mgr/cephadm/default_cephadm_command_timeout 1800
Then apply my OSD spec, it works.
When I tried that last time, I change the timeout during cephadm
execution, so not all cephadm commands used it, and the error
reporting the timeout certainly show the new timeout (1800)...
What I've seen that certainly could be improved is the non-atomic,
even non serialized operations that leads to :
1- all OSD are created (OSD as in `ceph osd ls`
2- devices are created (PV/VG/LV/dmcrypt)
3- then one OSD by one, the folder /var/lib/ceph/FSID/osd.ID with
block and block.db links
4- and the systemd unit is created and started
The timeout happens during the step 3.
This has also already been brought up, not sure if here or on Slack
though. It seems like one of the suboptimal default settings that work
for most use cases, but not all. Maybe a note in the docs could
suffice to increase the timeout when the operator intends to deploy
many OSDs per node at once. Not sure if adding one more option during
bootstrap is worth the hastle.
As all the devices are now in use and ceph OSD ID already created,
there is no means to recover I think and restart...
Not sure I understand what you mean.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io