Eugen,

Thanks, I forgot that operation started with the orchestrator can be stopped. You were right: stopping the 'osd rm' was enough to unblock the upgrade. I am not completely sure what is the consequence on the replace flag: I have the feeling it has been lost somehow as the OSD is no longer listed by 'ceph orch osd rm status' and 'ceph -s' reports now one OSD down and 1 stray daemon instead of 2 stray daemons.

Michel

Le 30/04/2025 à 09:24, Eugen Block a écrit :
You can stop the osd removal:

ceph orch osd rm stop <OSD_ID>

I'm not entirely sure what the orchestrator will do except for clearing the pending state, and since the OSDs are already marked as destroyed in the crush tree, I wouldn't expect anything weird. But it's worth a try, I guess.

Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:

Hi,

I had no time to really investigate more our problem yesterday. But I realized one issue that may explain the problem with osd.253: the underlying disk is so dead that it is no longer visible by the OS. Probably I added --zap when I did the 'ceph orch osd rm' and thus it is trying to do the zapping, fails as it doesn't find the disk and retries indefinitely... I remain a little bit surprise that this zapping error is not reported (without the traceback) at the INFO level and requires DEBUG to be seen but it is a detail. I'm surprised that Ceph is not giving up on zapping if it cannot access the device or did I miss something and there is a way to stop this process?

May be it is a corner case that has been fixed/improved since 18.2.2... Anyway, the question remains: is there a way out of this problem (that seems the only reason for the upgrade not really starting) apart from getting the replacement device?

Best regards,

Michel

Le 28/04/2025 à 18:19, Michel Jouvin a écrit :
Hi Frédéric,

Thanks for the command. I'm always looking at the wrong page of the doc! I looked at https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/ that list the Ceph subsystem and their default log level but there is no mention of cephadm there... After enabling cephadm debug log level and restarting the upgrade, I got the messages below. The only thing strange points to the problem with osd.253 where it tries to zap the device that was probably already zapped and thus cannot find the LV volume associated with osd.253. There is not really any other messages saying the impact on the upgrade but I guess it is the reason. What do you think ? And is there any way to fix it, other than replacing the OSD?

Best regards,

Michel

--------------------- cephadm debug level log -------------------------

2025-04-28T17:32:12.713746+0200 mgr.dig-mon1.fownxo [INF] Upgrade: Started with target quay.io/ceph/ceph:v18.2.6 2025-04-28T17:32:14.822030+0200 mgr.dig-mon1.fownxo [DBG] Refreshed host dig-osd4 devices (23) 2025-04-28T17:32:14.822550+0200 mgr.dig-mon1.fownxo [DBG] Finding OSDSpecs for host: <dig-osd4> 2025-04-28T17:32:14.822614+0200 mgr.dig-mon1.fownxo [DBG] Generating OSDSpec previews for [] 2025-04-28T17:32:14.822695+0200 mgr.dig-mon1.fownxo [DBG] Loading OSDSpec previews to HostCache for host <dig-osd4> 2025-04-28T17:32:14.985257+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'config generate-minimal-conf' -> 0 in 0.005s 2025-04-28T17:32:15.262102+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'auth get' -> 0 in 0.277s 2025-04-28T17:32:15.262751+0200 mgr.dig-mon1.fownxo [DBG] Combine hosts with existing daemons [] + new hosts.... (very long line)

2025-04-28T17:32:15.416491+0200 mgr.dig-mon1.fownxo [DBG] _update_paused_health 2025-04-28T17:32:17.314607+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'osd df' -> 0 in 0.064s 2025-04-28T17:32:17.637526+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'osd df' -> 0 in 0.320s 2025-04-28T17:32:17.645703+0200 mgr.dig-mon1.fownxo [DBG] 2 OSDs are scheduled for removal: [osd.381, osd.253] 2025-04-28T17:32:17.661910+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'osd df' -> 0 in 0.011s 2025-04-28T17:32:17.667068+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'osd safe-to-destroy' -> 0 in 0.002s 2025-04-28T17:32:17.667117+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd safe-to-destroy returns: 2025-04-28T17:32:17.667164+0200 mgr.dig-mon1.fownxo [DBG] running cmd: osd down on ids [osd.381] 2025-04-28T17:32:17.667854+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'osd down' -> 0 in 0.001s 2025-04-28T17:32:17.667908+0200 mgr.dig-mon1.fownxo [INF] osd.381 now down 2025-04-28T17:32:17.668446+0200 mgr.dig-mon1.fownxo [INF] Daemon osd.381 on dig-osd6 was already removed 2025-04-28T17:32:17.669534+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'osd destroy-actual' -> 0 in 0.001s 2025-04-28T17:32:17.669675+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd destroy-actual returns: 2025-04-28T17:32:17.669789+0200 mgr.dig-mon1.fownxo [INF] Successfully destroyed old osd.381 on dig-osd6; ready for replacement 2025-04-28T17:32:17.669874+0200 mgr.dig-mon1.fownxo [DBG] Removing osd.381 from the queue. 2025-04-28T17:32:17.680411+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'osd df' -> 0 in 0.010s 2025-04-28T17:32:17.685141+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'osd safe-to-destroy' -> 0 in 0.002s 2025-04-28T17:32:17.685190+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd safe-to-destroy returns: 2025-04-28T17:32:17.685234+0200 mgr.dig-mon1.fownxo [DBG] running cmd: osd down on ids [osd.253] 2025-04-28T17:32:17.685710+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'osd down' -> 0 in 0.000s 2025-04-28T17:32:17.685759+0200 mgr.dig-mon1.fownxo [INF] osd.253 now down 2025-04-28T17:32:17.686186+0200 mgr.dig-mon1.fownxo [INF] Daemon osd.253 on dig-osd4 was already removed 2025-04-28T17:32:17.687068+0200 mgr.dig-mon1.fownxo [DBG] mon_command: 'osd destroy-actual' -> 0 in 0.001s 2025-04-28T17:32:17.687102+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd destroy-actual returns: 2025-04-28T17:32:17.687141+0200 mgr.dig-mon1.fownxo [INF] Successfully destroyed old osd.253 on dig-osd4; ready for replacement 2025-04-28T17:32:17.687176+0200 mgr.dig-mon1.fownxo [INF] Zapping devices for osd.253 on dig-osd4 2025-04-28T17:32:17.687508+0200 mgr.dig-mon1.fownxo [DBG] _run_cephadm : command = ceph-volume 2025-04-28T17:32:17.687554+0200 mgr.dig-mon1.fownxo [DBG] _run_cephadm : args = ['--', 'lvm', 'zap', '--osd-id', '253', '--destroy'] 2025-04-28T17:32:17.687637+0200 mgr.dig-mon1.fownxo [DBG] osd container image quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f 2025-04-28T17:32:17.687677+0200 mgr.dig-mon1.fownxo [DBG] args: --image quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f --timeout 895 ceph-volume --fsid f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy 2025-04-28T17:32:17.687733+0200 mgr.dig-mon1.fownxo [DBG] Running command: which python3 2025-04-28T17:32:17.731474+0200 mgr.dig-mon1.fownxo [DBG] Running command: /usr/bin/python3 /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d --image quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f --timeout 895 ceph-volume --fsid f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy
2025-04-28T17:32:20.406723+0200 mgr.dig-mon1.fownxo [DBG] code: 1
2025-04-28T17:32:20.406764+0200 mgr.dig-mon1.fownxo [DBG] err: Inferring config /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/config/ceph.conf Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk --init -e CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z -v /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z -v /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f lvm zap --osd-id 253 --destroy
/usr/bin/podman: stderr Traceback (most recent call last):
/usr/bin/podman: stderr   File "/usr/sbin/ceph-volume", line 11, in <module> /usr/bin/podman: stderr load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')() /usr/bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
/usr/bin/podman: stderr     self.main(self.argv)
/usr/bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
/usr/bin/podman: stderr     return f(*a, **kw)
/usr/bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main /usr/bin/podman: stderr     terminal.dispatch(self.mapper, subcommand_args) /usr/bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
/usr/bin/podman: stderr     instance.main()
/usr/bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 46, in main
/usr/bin/podman: stderr     terminal.dispatch(self.mapper, self.argv)
/usr/bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
/usr/bin/podman: stderr     instance.main()
/usr/bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", line 403, in main
/usr/bin/podman: stderr     self.zap_osd()
/usr/bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
/usr/bin/podman: stderr     return func(*a, **kw)
/usr/bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", line 301, in zap_osd /usr/bin/podman: stderr     devices = find_associated_devices(self.args.osd_id, self.args.osd_fsid) /usr/bin/podman: stderr   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", line 88, in find_associated_devices
/usr/bin/podman: stderr     '%s' % osd_id or osd_fsid)
/usr/bin/podman: stderr RuntimeError: Unable to find any LV for zapping OSD: 253
Traceback (most recent call last):
  File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", line 10700, in <module>   File "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", line 10688, in main   File "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", line 2445, in _infer_config   File "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", line 2361, in _infer_fsid   File "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", line 2473, in _infer_image   File "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", line 2348, in _validate_fsid   File "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", line 6970, in command_ceph_volume   File "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", line 2136, in call_throws RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk --init -e CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z -v /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z -v /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f lvm zap --osd-id 253 --destroy 2025-04-28T17:32:20.409316+0200 mgr.dig-mon1.fownxo [DBG] serve loop sleep

-----------------------


Le 28/04/2025 à 14:00, Frédéric Nass a écrit :
Hi Michel,

You need to turn on cephadm debugging as described here [1] in the documentation

$ ceph config set mgr mgr/cephadm/log_to_cluster_level debug

and then look for any hints with

$ ceph -W cephadm --watch-debug

or

$ tail -f /var/log/ceph/$(ceph fsid)/ceph.cephadm.log (on the active MGR)

when you start/stop the upgrade.

Regards,
Frédéric.

[1] https://docs.ceph.com/en/reef/cephadm/operations/

----- Le 28 Avr 25, à 12:52, Michel Jouvin michel.jou...@ijclab.in2p3.fr a écrit :

Eugen,

Thanks for doing the test. I scanned all logs and cannot find anything except the message mentioned displayed every 10s about the removed OSDs that led me to think there is something not exactly as expected... No clue
what...

Michel
Sent from my mobile
Le 28 avril 2025 12:43:23 Eugen Block <ebl...@nde.ag> a écrit :

I just tried this on a single-node virtual test cluster, deployed it
with 18.2.2. Then I removed one OSD with --replace flag (no --zap,
otherwise it would redeploy the OSD on that VM). Then I also see the
stray daemon warning, but the upgrade from 18.2.2 to 18.2.6 finished
successfully. That's why I don't think the stray daemon is the root
cause here. I would suggest to scan monitor and cephadm logs as well.
After the upgrade to 18.2.6 the stray warning cleared, btw.


Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:

Eugen,

As said in a previous message, I found a tracker issue with a
similar problem: https://tracker.ceph.com/issues/67018, even if the
cause may be different as it is in older versions than me. For some
reasons the sequence of messages every 10s is now back on the 2 OSDs:

2025-04-28T10:00:28.226741+0200 mgr.dig-mon1.fownxo [INF] osd.253 now down
2025-04-28T10:00:28.227249+0200 mgr.dig-mon1.fownxo [INF] Daemon
osd.253 on dig-osd4 was already removed
2025-04-28T10:00:28.228929+0200 mgr.dig-mon1.fownxo [INF]
Successfully destroyed old osd.253 on dig-osd4; ready for replacement
2025-04-28T10:00:28.228994+0200 mgr.dig-mon1.fownxo [INF] Zapping
devices for osd.253 on dig-osd4
2025-04-28T10:00:39.132028+0200 mgr.dig-mon1.fownxo [INF] osd.381 now down
2025-04-28T10:00:39.132599+0200 mgr.dig-mon1.fownxo [INF] Daemon
osd.381 on dig-osd6 was already removed
2025-04-28T10:00:39.133424+0200 mgr.dig-mon1.fownxo [INF]
Successfully destroyed old osd.381 on dig-osd6; ready for replacement

except that the "Zapping.." message is not present for the second OSD...

I tried to increase the mgr log verbosity with 'ceph tell
mgr.dig-mon1.fownxo config set debug_mgr 20/20' and there stop/start
the upgrade without any additonal message displayed.

Michel

Le 28/04/2025 à 09:20, Eugen Block a écrit :
Have you increased the debug level for the mgr? It would surprise
me if stray daemons would really block an upgrade. But debug logs
might reveal something. And if it can be confirmed that the strays
really block the upgrade, you could either remove the OSDs entirely
(they are already drained) to continue upgrading, or create a
tracker issue to report this and wait for instructions.

Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:

Hi Eugen,

Yes I stopped and restarted the upgrade several times already, in
particular after failing over the mgr. And the only messages
related are the upgrade started and upgrade canceled ones. Nothing
related to an error or a crash...

For me the question is why do I have stray daemons after removing
OSD. IMO it is unexpected as these daemons are not there anymore.
I can understand that stray daemons prevent the upgrade to start
if they are really strayed... And it would be nice if cephadm was
giving a message about why the upgrade does not really start
despite its status is "in progress"...

Best regards,

Michel
Sent from my mobile
Le 28 avril 2025 07:27:44 Eugen Block <ebl...@nde.ag> a écrit :

Do you see anything in the mgr log? To get fresh logs I would cancel
the upgrade (ceph orch upgrade stop) and then try again.
A workaround could be to manually upgrade the mgr daemons by changing their unit.run file, but that would be my last resort. Btwm did you
stop and start the upgrade after failing the mgr as well?

Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:

Eugen,

Thanks for the hint. Here is the osd_remove_queue:

[root@ijc-mon1 ~]# ceph config-key get mgr/cephadm/osd_remove_queue|jq
[
  {
    "osd_id": 253,
    "started": true,
    "draining": false,
    "stopped": false,
    "replace": true,
    "force": false,
    "zap": true,
    "hostname": "dig-osd4",
    "drain_started_at": null,
    "drain_stopped_at": null,
    "drain_done_at": "2025-04-15T14:09:30.521534Z",
    "process_started_at": "2025-04-15T14:09:14.091592Z"
  },
  {
    "osd_id": 381,
    "started": true,
    "draining": false,
    "stopped": false,
    "replace": true,
    "force": false,
    "zap": false,
    "hostname": "dig-osd6",
    "drain_started_at": "2025-04-23T11:56:09.864724Z",
    "drain_stopped_at": null,
    "drain_done_at": "2025-04-25T06:53:03.678729Z",
    "process_started_at": "2025-04-23T11:56:05.924923Z"
  }
]

It is not empty the two stray daemons are listed. Not sure it these entries are expected as I specified --replace... A similar issue was reported in https://tracker.ceph.com/issues/67018 so before Reef but the cause may be different. Still not clear for me how to get out of this, except may be replacing the OSDs but this will take some time...

Best regards,

Michel

Le 27/04/2025 à 10:21, Eugen Block a écrit :
Hi,

what's the current ceph status? Wasn't there a bug in early Reef
versions preventing upgrades if there were removed OSDs in the
queue? But IIRC, the cephadm module would crash. Can you check

ceph config-key get mgr/cephadm/osd_remove_queue

And then I would check the mgr log, maybe set it to a higher debug
level to see what's blocking it.

Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:

Hi,

I tried to restart all the mgrs (we have 3, 1 active, 2 standby)
by executing 3 times the `ceph mgr fail`, no impact. I don't
really understand why I get these stray daemons after doing a
'ceph orch osd rm --replace` but I think I have always seen this. I tried to mute rather than disable the stray daemon check but it doesn't help either. And I find strange this message every 10s about one of the destroyed OSD and only one, reporting it is down and already destroyed and saying it'll zap it (I think I didn't
add --zap when I removed it as the underlying disk is dead).

I'm completely stuck with this upgrade and I don't remember having this kind of problems in previous upgrades with cephadm... Any
idea where to look for the cause and/or how to fix it?

Best regards,

Michel

Le 24/04/2025 à 23:34, Michel Jouvin a écrit :
Hi,

I'm trying to upgrade a (cephadm) cluster from 18.2.2 to 18.2.6, using 'ceph orch upgrade'. When I enter the command 'ceph orch upgrade start --ceph-version 18.2.6', I receive a message saying that the upgrade has been initiated, with a similar message in
the logs but nothing happens after this. 'ceph orch upgrade
status' says:

-------

[root@ijc-mon1 ~]# ceph orch upgrade status
{
    "target_image": "quay.io/ceph/ceph:v18.2.6",
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [],
    "progress": "",
    "message": "",
    "is_paused": false
}
-------

The first time I entered the command, the cluster status was
HEALTH_WARN because of 2 stray daemons (caused by destroyed OSDs, rm --replace). I set mgr/cephadm/warn_on_stray_daemons to false to ignore these 2 daemons, the cluster is now HEALTH_OK but it doesn't help. Following a Red Hat KB entry, I tried to failover
the mgr, stopped an restarted the upgrade but without any
improvement. I have not seen anything in the logs, except that there is an INF entry every 10s about the destroyed OSD saying:

------

2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14079 : cephadm [INF] osd.253 now down
2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 on dig-osd4
was already removed
2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14081 : cephadm [INF] Successfully destroyed old
osd.253 on dig-osd4; ready for replacement
2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14082 : cephadm [INF] Zapping devices for osd.253
on dig-osd4
-----

The message seems to be only for one of the 2 destroyed OSDs
since I restarted the mgr. May this be the cause for the stucked
upgrade? What can I do for fixing this?

Thanks in advance for any hint. Best regards,

Michel

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to