Hi,

Thanks for all the feedback and suggestions. Summary of the summary: after stopping the removal for the OSD waiting to be zapped (because of the no longer available disk), the upgrade started immediately and ran well. The cluster is now running 18.2.6! And as said previously by Eugen, I confirm that in 18.2.6, removed OSDs are no longer considered stray daemons. I still have the feeling that Ceph could give more useful information if:

- a cephadm message at INFO level (and visible with 'ceph orch upgrade status' would report that the upgrade cannot proceed because of described reason. This information could be given once, a few minutes after entering the upgrade command is no daemon has been upgraded yet, for example.

- a message at INFO level was informing that the zap operation failed (suggesting to use DEBUG level for more information)

About Anthony's last question, yes the 2 OSDs were destroyed as showed by:

# ceph osd tree|grep destroyed
253    hdd    16.37108                  osd.253 destroyed         0  1.00000
381    hdd    16.37108                  osd.381 destroyed         0  1.00000

@Eugen regarding what I said about osd.381 being picked up by Ceph to replace the failed osd.381 OSD, I think it is the conjunction of the fact that osd.all-available-devices service placement was not set to unmanaged (something we tend to do normally but as we add a few servers recently we changed it and forgot to set it back to unmanaged) and that in the initial removal I zapped the device. Because of this, the device appeared to be free for use... May be it should be better documented that you should not zap a device intended for definitive removal if you don't have osd.all-available-devices service placement was set to unmanaged...

Thanks again. Best regards,

Michel

Le 30/04/2025 à 15:41, Eugen Block a écrit :
Hm, I thought there was an excerpt from the osd tree, but apparently not? Could you then please confirm that the OSDs are in fact marked as destroyed in the osd tree?

Zitat von Anthony D'Atri <anthony.da...@gmail.com>:


I'm not entirely sure what the orchestrator will do except for clearing the pending state, and since the OSDs are already marked as destroyed in the crush tree,

Do we know that they are?  The thread shows some log messages, but not unless I’m missing it evidence that they were marked. When I ran into a similar issue recently, they were not marked destroyed in the CRUSH tree.


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to