Eugen,
Thanks for the hint. Here is the osd_remove_queue:
[root@ijc-mon1 ~]# ceph config-key get mgr/cephadm/osd_remove_queue|jq
[
{
"osd_id": 253,
"started": true,
"draining": false,
"stopped": false,
"replace": true,
"force": false,
"zap": true,
"hostname": "dig-osd4",
"drain_started_at": null,
"drain_stopped_at": null,
"drain_done_at": "2025-04-15T14:09:30.521534Z",
"process_started_at": "2025-04-15T14:09:14.091592Z"
},
{
"osd_id": 381,
"started": true,
"draining": false,
"stopped": false,
"replace": true,
"force": false,
"zap": false,
"hostname": "dig-osd6",
"drain_started_at": "2025-04-23T11:56:09.864724Z",
"drain_stopped_at": null,
"drain_done_at": "2025-04-25T06:53:03.678729Z",
"process_started_at": "2025-04-23T11:56:05.924923Z"
}
]
It is not empty the two stray daemons are listed. Not sure it these
entries are expected as I specified --replace... A similar issue was
reported in https://tracker.ceph.com/issues/67018 so before Reef but
the cause may be different. Still not clear for me how to get out of
this, except may be replacing the OSDs but this will take some time...
Best regards,
Michel
Le 27/04/2025 à 10:21, Eugen Block a écrit :
Hi,
what's the current ceph status? Wasn't there a bug in early Reef
versions preventing upgrades if there were removed OSDs in the
queue? But IIRC, the cephadm module would crash. Can you check
ceph config-key get mgr/cephadm/osd_remove_queue
And then I would check the mgr log, maybe set it to a higher debug
level to see what's blocking it.
Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
Hi,
I tried to restart all the mgrs (we have 3, 1 active, 2 standby)
by executing 3 times the `ceph mgr fail`, no impact. I don't
really understand why I get these stray daemons after doing a
'ceph orch osd rm --replace` but I think I have always seen this.
I tried to mute rather than disable the stray daemon check but it
doesn't help either. And I find strange this message every 10s
about one of the destroyed OSD and only one, reporting it is down
and already destroyed and saying it'll zap it (I think I didn't
add --zap when I removed it as the underlying disk is dead).
I'm completely stuck with this upgrade and I don't remember having
this kind of problems in previous upgrades with cephadm... Any
idea where to look for the cause and/or how to fix it?
Best regards,
Michel
Le 24/04/2025 à 23:34, Michel Jouvin a écrit :
Hi,
I'm trying to upgrade a (cephadm) cluster from 18.2.2 to 18.2.6,
using 'ceph orch upgrade'. When I enter the command 'ceph orch
upgrade start --ceph-version 18.2.6', I receive a message saying
that the upgrade has been initiated, with a similar message in
the logs but nothing happens after this. 'ceph orch upgrade
status' says:
-------
[root@ijc-mon1 ~]# ceph orch upgrade status
{
"target_image": "quay.io/ceph/ceph:v18.2.6",
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [],
"progress": "",
"message": "",
"is_paused": false
}
-------
The first time I entered the command, the cluster status was
HEALTH_WARN because of 2 stray daemons (caused by destroyed OSDs,
rm --replace). I set mgr/cephadm/warn_on_stray_daemons to false
to ignore these 2 daemons, the cluster is now HEALTH_OK but it
doesn't help. Following a Red Hat KB entry, I tried to failover
the mgr, stopped an restarted the upgrade but without any
improvement. I have not seen anything in the logs, except that
there is an INF entry every 10s about the destroyed OSD saying:
------
2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14079 : cephadm [INF] osd.253 now down
2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 on dig-osd4
was already removed
2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14081 : cephadm [INF] Successfully destroyed old
osd.253 on dig-osd4; ready for replacement
2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14082 : cephadm [INF] Zapping devices for osd.253
on dig-osd4
-----
The message seems to be only for one of the 2 destroyed OSDs
since I restarted the mgr. May this be the cause for the stucked
upgrade? What can I do for fixing this?
Thanks in advance for any hint. Best regards,
Michel
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io