Hi,
what's the current ceph status? Wasn't there a bug in early Reef
versions preventing upgrades if there were removed OSDs in the queue?
But IIRC, the cephadm module would crash. Can you check
ceph config-key get mgr/cephadm/osd_remove_queue
And then I would check the mgr log, maybe set it to a higher debug
level to see what's blocking it.
Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
Hi,
I tried to restart all the mgrs (we have 3, 1 active, 2 standby) by
executing 3 times the `ceph mgr fail`, no impact. I don't really
understand why I get these stray daemons after doing a 'ceph orch
osd rm --replace` but I think I have always seen this. I tried to
mute rather than disable the stray daemon check but it doesn't help
either. And I find strange this message every 10s about one of the
destroyed OSD and only one, reporting it is down and already
destroyed and saying it'll zap it (I think I didn't add --zap when I
removed it as the underlying disk is dead).
I'm completely stuck with this upgrade and I don't remember having
this kind of problems in previous upgrades with cephadm... Any idea
where to look for the cause and/or how to fix it?
Best regards,
Michel
Le 24/04/2025 à 23:34, Michel Jouvin a écrit :
Hi,
I'm trying to upgrade a (cephadm) cluster from 18.2.2 to 18.2.6,
using 'ceph orch upgrade'. When I enter the command 'ceph orch
upgrade start --ceph-version 18.2.6', I receive a message saying
that the upgrade has been initiated, with a similar message in the
logs but nothing happens after this. 'ceph orch upgrade status' says:
-------
[root@ijc-mon1 ~]# ceph orch upgrade status
{
"target_image": "quay.io/ceph/ceph:v18.2.6",
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [],
"progress": "",
"message": "",
"is_paused": false
}
-------
The first time I entered the command, the cluster status was
HEALTH_WARN because of 2 stray daemons (caused by destroyed OSDs,
rm --replace). I set mgr/cephadm/warn_on_stray_daemons to false to
ignore these 2 daemons, the cluster is now HEALTH_OK but it doesn't
help. Following a Red Hat KB entry, I tried to failover the mgr,
stopped an restarted the upgrade but without any improvement. I
have not seen anything in the logs, except that there is an INF
entry every 10s about the destroyed OSD saying:
------
2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028)
14079 : cephadm [INF] osd.253 now down
2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028)
14080 : cephadm [INF] Daemon osd.253 on dig-osd4 was already removed
2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028)
14081 : cephadm [INF] Successfully destroyed old osd.253 on
dig-osd4; ready for replacement
2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz (mgr.55376028)
14082 : cephadm [INF] Zapping devices for osd.253 on dig-osd4
-----
The message seems to be only for one of the 2 destroyed OSDs since
I restarted the mgr. May this be the cause for the stucked upgrade?
What can I do for fixing this?
Thanks in advance for any hint. Best regards,
Michel
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io