Hi,
Thanks for all the feedback and suggestions. Summary of the summary:
after stopping the removal for the OSD waiting to be zapped (because of
the no longer available disk), the upgrade started immediately and ran
well. The cluster is now running 18.2.6! And as said previously by
Eugen, I confirm that in 18.2.6, removed OSDs are no longer considered
stray daemons. I still have the feeling that Ceph could give more useful
information if:
- a cephadm message at INFO level (and visible with 'ceph orch upgrade
status' would report that the upgrade cannot proceed because of
described reason. This information could be given once, a few minutes
after entering the upgrade command is no daemon has been upgraded yet,
for example.
- a message at INFO level was informing that the zap operation failed
(suggesting to use DEBUG level for more information)
About Anthony's last question, yes the 2 OSDs were destroyed as showed by:
# ceph osd tree|grep destroyed
253 hdd 16.37108 osd.253 destroyed 0 1.00000
381 hdd 16.37108 osd.381 destroyed 0 1.00000
@Eugen regarding what I said about osd.381 being picked up by Ceph to
replace the failed osd.381 OSD, I think it is the conjunction of the
fact that osd.all-available-devices service placement was not set to
unmanaged (something we tend to do normally but as we add a few servers
recently we changed it and forgot to set it back to unmanaged) and that
in the initial removal I zapped the device. Because of this, the device
appeared to be free for use... May be it should be better documented
that you should not zap a device intended for definitive removal if you
don't have osd.all-available-devices service placement was set to
unmanaged...
Thanks again. Best regards,
Michel
Le 30/04/2025 à 15:41, Eugen Block a écrit :
Hm, I thought there was an excerpt from the osd tree, but apparently
not? Could you then please confirm that the OSDs are in fact marked as
destroyed in the osd tree?
Zitat von Anthony D'Atri <anthony.da...@gmail.com>:
I'm not entirely sure what the orchestrator will do except for
clearing the pending state, and since the OSDs are already marked as
destroyed in the crush tree,
Do we know that they are? The thread shows some log messages, but
not unless I’m missing it evidence that they were marked. When I ran
into a similar issue recently, they were not marked destroyed in the
CRUSH tree.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io