[ceph-users] Re: 18.2.2: Upgrade not starting (ceph orch upgrade)

Michel Jouvin Mon, 28 Apr 2025 03:54:05 -0700

Eugen,

Thanks for doing the test. I scanned all logs and cannot find anythingexcept the message mentioned displayed every 10s about the removed OSDsthat led me to think there is something not exactly as expected... No cluewhat...


Michel
Sent from my mobile
Le 28 avril 2025 12:43:23 Eugen Block <ebl...@nde.ag> a écrit :

I just tried this on a single-node virtual test cluster, deployed it
with 18.2.2. Then I removed one OSD with --replace flag (no --zap,
otherwise it would redeploy the OSD on that VM). Then I also see the
stray daemon warning, but the upgrade from 18.2.2 to 18.2.6 finished
successfully. That's why I don't think the stray daemon is the root
cause here. I would suggest to scan monitor and cephadm logs as well.
After the upgrade to 18.2.6 the stray warning cleared, btw.


Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:

Eugen,

As said in a previous message, I found a tracker issue with a
similar problem: https://tracker.ceph.com/issues/67018, even if the
cause may be different as it is in older versions than me. For some
reasons the sequence of messages every 10s is now back on the 2 OSDs:

2025-04-28T10:00:28.226741+0200 mgr.dig-mon1.fownxo [INF] osd.253 now down
2025-04-28T10:00:28.227249+0200 mgr.dig-mon1.fownxo [INF] Daemon
osd.253 on dig-osd4 was already removed
2025-04-28T10:00:28.228929+0200 mgr.dig-mon1.fownxo [INF]
Successfully destroyed old osd.253 on dig-osd4; ready for replacement
2025-04-28T10:00:28.228994+0200 mgr.dig-mon1.fownxo [INF] Zapping
devices for osd.253 on dig-osd4
2025-04-28T10:00:39.132028+0200 mgr.dig-mon1.fownxo [INF] osd.381 now down
2025-04-28T10:00:39.132599+0200 mgr.dig-mon1.fownxo [INF] Daemon
osd.381 on dig-osd6 was already removed
2025-04-28T10:00:39.133424+0200 mgr.dig-mon1.fownxo [INF]
Successfully destroyed old osd.381 on dig-osd6; ready for replacement

except that the "Zapping.." message is not present for the second OSD...

I tried to increase the mgr log verbosity with 'ceph tell
mgr.dig-mon1.fownxo config set debug_mgr 20/20' and there stop/start
the upgrade without any additonal message displayed.

Michel

Le 28/04/2025 à 09:20, Eugen Block a écrit :

Have you increased the debug level for the mgr? It would surprise
me if stray daemons would really block an upgrade. But debug logs
might reveal something. And if it can be confirmed that the strays
really block the upgrade, you could either remove the OSDs entirely
(they are already drained) to continue upgrading, or create a
tracker issue to report this and wait for instructions.

Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:

Hi Eugen,

Yes I stopped and restarted the upgrade several times already, in
particular after failing over the mgr. And the only messages
related are the upgrade started and upgrade canceled ones. Nothing
related to an error or a crash...

For me the question is why do I have stray daemons after removing
OSD. IMO it is unexpected as these daemons are not there anymore.
I can understand that stray daemons prevent the upgrade to start
if they are really strayed... And it would be nice if cephadm was
giving a message about why the upgrade does not really start
despite its status is "in progress"...

Best regards,

Michel
Sent from my mobile
Le 28 avril 2025 07:27:44 Eugen Block <ebl...@nde.ag> a écrit :

Do you see anything in the mgr log? To get fresh logs I would cancel
the upgrade (ceph orch upgrade stop) and then try again.
A workaround could be to manually upgrade the mgr daemons by changing
their unit.run file, but that would be my last resort. Btwm did you
stop and start the upgrade after failing the mgr as well?

Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:

Eugen,

Thanks for the hint. Here is the osd_remove_queue:

[root@ijc-mon1 ~]# ceph config-key get mgr/cephadm/osd_remove_queue|jq
[
 {
   "osd_id": 253,
   "started": true,
   "draining": false,
   "stopped": false,
   "replace": true,
   "force": false,
   "zap": true,
   "hostname": "dig-osd4",
   "drain_started_at": null,
   "drain_stopped_at": null,
   "drain_done_at": "2025-04-15T14:09:30.521534Z",
   "process_started_at": "2025-04-15T14:09:14.091592Z"
 },
 {
   "osd_id": 381,
   "started": true,
   "draining": false,
   "stopped": false,
   "replace": true,
   "force": false,
   "zap": false,
   "hostname": "dig-osd6",
   "drain_started_at": "2025-04-23T11:56:09.864724Z",
   "drain_stopped_at": null,
   "drain_done_at": "2025-04-25T06:53:03.678729Z",
   "process_started_at": "2025-04-23T11:56:05.924923Z"
 }
]

It is not empty the two stray daemons are listed. Not sure it these
entries are expected as I specified --replace... A similar issue was
reported in https://tracker.ceph.com/issues/67018 so before Reef but
the cause may be different. Still not clear for me how to get out of
this, except may be replacing the OSDs but this will take some time...

Best regards,

Michel

Le 27/04/2025 à 10:21, Eugen Block a écrit :

Hi,

what's the current ceph status? Wasn't there a bug in early Reef
versions preventing upgrades if there were removed OSDs in the
queue? But IIRC, the cephadm module would crash. Can you check

ceph config-key get mgr/cephadm/osd_remove_queue

And then I would check the mgr log, maybe set it to a higher debug
level to see what's blocking it.

Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:

Hi,

I tried to restart all the mgrs (we have 3, 1 active, 2 standby)
by executing 3 times the `ceph mgr fail`, no impact. I don't
really understand why I get these stray daemons after doing a
'ceph orch osd rm --replace` but I think I have always seen this.
I tried to mute rather than disable the stray daemon check but it
doesn't help either. And I find strange this message every 10s
about one of the destroyed OSD and only one, reporting it is down
and already destroyed and saying it'll zap it (I think I didn't
add --zap when I removed it as the underlying disk is dead).

I'm completely stuck with this upgrade and I don't remember having
this kind of problems in previous upgrades with cephadm... Any
idea where to look for the cause and/or how to fix it?

Best regards,

Michel

Le 24/04/2025 à 23:34, Michel Jouvin a écrit :

Hi,

I'm trying to upgrade a (cephadm) cluster from 18.2.2 to 18.2.6,
using 'ceph orch upgrade'. When I enter the command 'ceph orch
upgrade start --ceph-version 18.2.6', I receive a message saying
that the upgrade has been initiated, with a similar message in
the logs but nothing happens after this. 'ceph orch upgrade
status' says:

-------

[root@ijc-mon1 ~]# ceph orch upgrade status
{
   "target_image": "quay.io/ceph/ceph:v18.2.6",
   "in_progress": true,
   "which": "Upgrading all daemon types on all hosts",
   "services_complete": [],
   "progress": "",
   "message": "",
   "is_paused": false
}
-------

The first time I entered the command, the cluster status was
HEALTH_WARN because of 2 stray daemons (caused by destroyed OSDs,
rm --replace). I set mgr/cephadm/warn_on_stray_daemons to false
to ignore these 2 daemons, the cluster is now HEALTH_OK but it
doesn't help. Following a Red Hat KB entry, I tried to failover
the mgr, stopped an restarted the upgrade but without any
improvement. I have not seen anything in the logs, except that
there is an INF entry every 10s about the destroyed OSD saying:

------

2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14079 : cephadm [INF] osd.253 now down
2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 on dig-osd4
was already removed
2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14081 : cephadm [INF] Successfully destroyed old
osd.253 on dig-osd4; ready for replacement
2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14082 : cephadm [INF] Zapping devices for osd.253
on dig-osd4
-----

The message seems to be only for one of the 2 destroyed OSDs
since I restarted the mgr. May this be the cause for the stucked
upgrade? What can I do for fixing this?

Thanks in advance for any hint. Best regards,

Michel

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 18.2.2: Upgrade not starting (ceph orch upgrade)

Reply via email to