[ceph-users] Re: 18.2.2: Upgrade not starting (ceph orch upgrade)

Michel Jouvin Wed, 30 Apr 2025 01:30:34 -0700

Eugen,

Thanks, I forgot that operation started with the orchestrator can bestopped. You were right: stopping the 'osd rm' was enough to unblock theupgrade. I am not completely sure what is the consequence on the replaceflag: I have the feeling it has been lost somehow as the OSD is nolonger listed by 'ceph orch osd rm status' and 'ceph -s' reports now oneOSD down and 1 stray daemon instead of 2 stray daemons.


Michel

Le 30/04/2025 à 09:24, Eugen Block a écrit :

You can stop the osd removal:

ceph orch osd rm stop <OSD_ID>
I'm not entirely sure what the orchestrator will do except forclearing the pending state, and since the OSDs are already marked asdestroyed in the crush tree, I wouldn't expect anything weird. Butit's worth a try, I guess.
Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
Hi,
I had no time to really investigate more our problem yesterday. But Irealized one issue that may explain the problem with osd.253: theunderlying disk is so dead that it is no longer visible by the OS.Probably I added --zap when I did the 'ceph orch osd rm' and thus itis trying to do the zapping, fails as it doesn't find the disk andretries indefinitely... I remain a little bit surprise that thiszapping error is not reported (without the traceback) at the INFOlevel and requires DEBUG to be seen but it is a detail. I'm surprisedthat Ceph is not giving up on zapping if it cannot access the deviceor did I miss something and there is a way to stop this process?
May be it is a corner case that has been fixed/improved since18.2.2... Anyway, the question remains: is there a way out of thisproblem (that seems the only reason for the upgrade not reallystarting) apart from getting the replacement device?
Best regards,

Michel

Le 28/04/2025 à 18:19, Michel Jouvin a écrit :
Hi Frédéric,
Thanks for the command. I'm always looking at the wrong page of thedoc! I looked athttps://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/that list the Ceph subsystem and their default log level but thereis no mention of cephadm there... After enabling cephadm debug loglevel and restarting the upgrade, I got the messages below. The onlything strange points to the problem with osd.253 where it tries tozap the device that was probably already zapped and thus cannot findthe LV volume associated with osd.253. There is not really any othermessages saying the impact on the upgrade but I guess it is thereason. What do you think ? And is there any way to fix it, otherthan replacing the OSD?
Best regards,

Michel

--------------------- cephadm debug level log -------------------------
2025-04-28T17:32:12.713746+0200 mgr.dig-mon1.fownxo [INF] Upgrade:Started with target quay.io/ceph/ceph:v18.2.62025-04-28T17:32:14.822030+0200 mgr.dig-mon1.fownxo [DBG] Refreshedhost dig-osd4 devices (23)2025-04-28T17:32:14.822550+0200 mgr.dig-mon1.fownxo [DBG] FindingOSDSpecs for host: <dig-osd4>2025-04-28T17:32:14.822614+0200 mgr.dig-mon1.fownxo [DBG] GeneratingOSDSpec previews for []2025-04-28T17:32:14.822695+0200 mgr.dig-mon1.fownxo [DBG] LoadingOSDSpec previews to HostCache for host <dig-osd4>2025-04-28T17:32:14.985257+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'config generate-minimal-conf' -> 0 in 0.005s2025-04-28T17:32:15.262102+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'auth get' -> 0 in 0.277s2025-04-28T17:32:15.262751+0200 mgr.dig-mon1.fownxo [DBG] Combinehosts with existing daemons [] + new hosts.... (very long line)
2025-04-28T17:32:15.416491+0200 mgr.dig-mon1.fownxo [DBG]_update_paused_health2025-04-28T17:32:17.314607+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'osd df' -> 0 in 0.064s2025-04-28T17:32:17.637526+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'osd df' -> 0 in 0.320s2025-04-28T17:32:17.645703+0200 mgr.dig-mon1.fownxo [DBG] 2 OSDs arescheduled for removal: [osd.381, osd.253]2025-04-28T17:32:17.661910+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'osd df' -> 0 in 0.011s2025-04-28T17:32:17.667068+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'osd safe-to-destroy' -> 0 in 0.002s2025-04-28T17:32:17.667117+0200 mgr.dig-mon1.fownxo [DBG] cmd: osdsafe-to-destroy returns:2025-04-28T17:32:17.667164+0200 mgr.dig-mon1.fownxo [DBG] runningcmd: osd down on ids [osd.381]2025-04-28T17:32:17.667854+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'osd down' -> 0 in 0.001s2025-04-28T17:32:17.667908+0200 mgr.dig-mon1.fownxo [INF] osd.381now down2025-04-28T17:32:17.668446+0200 mgr.dig-mon1.fownxo [INF] Daemonosd.381 on dig-osd6 was already removed2025-04-28T17:32:17.669534+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'osd destroy-actual' -> 0 in 0.001s2025-04-28T17:32:17.669675+0200 mgr.dig-mon1.fownxo [DBG] cmd: osddestroy-actual returns:2025-04-28T17:32:17.669789+0200 mgr.dig-mon1.fownxo [INF]Successfully destroyed old osd.381 on dig-osd6; ready for replacement2025-04-28T17:32:17.669874+0200 mgr.dig-mon1.fownxo [DBG] Removingosd.381 from the queue.2025-04-28T17:32:17.680411+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'osd df' -> 0 in 0.010s2025-04-28T17:32:17.685141+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'osd safe-to-destroy' -> 0 in 0.002s2025-04-28T17:32:17.685190+0200 mgr.dig-mon1.fownxo [DBG] cmd: osdsafe-to-destroy returns:2025-04-28T17:32:17.685234+0200 mgr.dig-mon1.fownxo [DBG] runningcmd: osd down on ids [osd.253]2025-04-28T17:32:17.685710+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'osd down' -> 0 in 0.000s2025-04-28T17:32:17.685759+0200 mgr.dig-mon1.fownxo [INF] osd.253now down2025-04-28T17:32:17.686186+0200 mgr.dig-mon1.fownxo [INF] Daemonosd.253 on dig-osd4 was already removed2025-04-28T17:32:17.687068+0200 mgr.dig-mon1.fownxo [DBG]mon_command: 'osd destroy-actual' -> 0 in 0.001s2025-04-28T17:32:17.687102+0200 mgr.dig-mon1.fownxo [DBG] cmd: osddestroy-actual returns:2025-04-28T17:32:17.687141+0200 mgr.dig-mon1.fownxo [INF]Successfully destroyed old osd.253 on dig-osd4; ready for replacement2025-04-28T17:32:17.687176+0200 mgr.dig-mon1.fownxo [INF] Zappingdevices for osd.253 on dig-osd42025-04-28T17:32:17.687508+0200 mgr.dig-mon1.fownxo [DBG]_run_cephadm : command = ceph-volume2025-04-28T17:32:17.687554+0200 mgr.dig-mon1.fownxo [DBG]_run_cephadm : args = ['--', 'lvm', 'zap', '--osd-id', '253','--destroy']2025-04-28T17:32:17.687637+0200 mgr.dig-mon1.fownxo [DBG] osdcontainer imagequay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f2025-04-28T17:32:17.687677+0200 mgr.dig-mon1.fownxo [DBG] args:--imagequay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f--timeout 895 ceph-volume --fsidf5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy2025-04-28T17:32:17.687733+0200 mgr.dig-mon1.fownxo [DBG] Runningcommand: which python32025-04-28T17:32:17.731474+0200 mgr.dig-mon1.fownxo [DBG] Runningcommand: /usr/bin/python3/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d--imagequay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f--timeout 895 ceph-volume --fsidf5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy
2025-04-28T17:32:20.406723+0200 mgr.dig-mon1.fownxo [DBG] code: 1
2025-04-28T17:32:20.406764+0200 mgr.dig-mon1.fownxo [DBG] err:Inferring config/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/config/ceph.confNon-zero exit code 1 from /usr/bin/podman run --rm --ipc=host--stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume--privileged --group-add=disk --init -eCONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f-e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -eCEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v/var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z-v/var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z-v/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z-v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v/run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v/run/lock/lvm:/run/lock/lvm -v/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro-v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v/tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:zquay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641flvm zap --osd-id 253 --destroy
/usr/bin/podman: stderr Traceback (most recent call last):
/usr/bin/podman: stderr File "/usr/sbin/ceph-volume", line 11, in<module>/usr/bin/podman: stderr load_entry_point('ceph-volume==1.0.0','console_scripts', 'ceph-volume')()/usr/bin/podman: stderr File"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in__init__
/usr/bin/podman: stderr     self.main(self.argv)
/usr/bin/podman: stderr File"/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line59, in newfunc
/usr/bin/podman: stderr     return f(*a, **kw)
/usr/bin/podman: stderr File"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, inmain/usr/bin/podman: stderr terminal.dispatch(self.mapper,subcommand_args)/usr/bin/podman: stderr File"/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line194, in dispatch
/usr/bin/podman: stderr     instance.main()
/usr/bin/podman: stderr File"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py",line 46, in main
/usr/bin/podman: stderr     terminal.dispatch(self.mapper, self.argv)
/usr/bin/podman: stderr File"/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line194, in dispatch
/usr/bin/podman: stderr     instance.main()
/usr/bin/podman: stderr File"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",line 403, in main
/usr/bin/podman: stderr     self.zap_osd()
/usr/bin/podman: stderr File"/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line16, in is_root
/usr/bin/podman: stderr     return func(*a, **kw)
/usr/bin/podman: stderr File"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",line 301, in zap_osd/usr/bin/podman: stderr devices =find_associated_devices(self.args.osd_id, self.args.osd_fsid)/usr/bin/podman: stderr File"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",line 88, in find_associated_devices
/usr/bin/podman: stderr     '%s' % osd_id or osd_fsid)
/usr/bin/podman: stderr RuntimeError: Unable to find any LV forzapping OSD: 253
Traceback (most recent call last):
File "/usr/lib64/python3.9/runpy.py", line 197, in_run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
File"/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",line 10700, in <module> File"/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",line 10688, in main File"/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",line 2445, in _infer_config File"/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",line 2361, in _infer_fsid File"/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",line 2473, in _infer_image File"/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",line 2348, in _validate_fsid File"/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",line 6970, in command_ceph_volume File"/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",line 2136, in call_throwsRuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host--stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume--privileged --group-add=disk --init -eCONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f-e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -eCEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v/var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z-v/var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z-v/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z-v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v/run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v/run/lock/lvm:/run/lock/lvm -v/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro-v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v/tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:zquay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641flvm zap --osd-id 253 --destroy2025-04-28T17:32:20.409316+0200 mgr.dig-mon1.fownxo [DBG] serve loopsleep
-----------------------


Le 28/04/2025 à 14:00, Frédéric Nass a écrit :
Hi Michel,
You need to turn on cephadm debugging as described here [1] in thedocumentation
$ ceph config set mgr mgr/cephadm/log_to_cluster_level debug

and then look for any hints with

$ ceph -W cephadm --watch-debug

or
$ tail -f /var/log/ceph/$(ceph fsid)/ceph.cephadm.log (on theactive MGR)
when you start/stop the upgrade.

Regards,
Frédéric.

[1] https://docs.ceph.com/en/reef/cephadm/operations/
----- Le 28 Avr 25, à 12:52, Michel Jouvinmichel.jou...@ijclab.in2p3.fr a écrit :
Eugen,
Thanks for doing the test. I scanned all logs and cannot findanythingexcept the message mentioned displayed every 10s about the removedOSDsthat led me to think there is something not exactly as expected...No clue
what...

Michel
Sent from my mobile
Le 28 avril 2025 12:43:23 Eugen Block <ebl...@nde.ag> a écrit :
I just tried this on a single-node virtual test cluster, deployed it
with 18.2.2. Then I removed one OSD with --replace flag (no --zap,
otherwise it would redeploy the OSD on that VM). Then I also see the
stray daemon warning, but the upgrade from 18.2.2 to 18.2.6 finished
successfully. That's why I don't think the stray daemon is the root
cause here. I would suggest to scan monitor and cephadm logs aswell.
After the upgrade to 18.2.6 the stray warning cleared, btw.


Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
Eugen,

As said in a previous message, I found a tracker issue with a
similar problem: https://tracker.ceph.com/issues/67018, even if the
cause may be different as it is in older versions than me. For some
reasons the sequence of messages every 10s is now back on the 2OSDs:
2025-04-28T10:00:28.226741+0200 mgr.dig-mon1.fownxo [INF]osd.253 now down
2025-04-28T10:00:28.227249+0200 mgr.dig-mon1.fownxo [INF] Daemon
osd.253 on dig-osd4 was already removed
2025-04-28T10:00:28.228929+0200 mgr.dig-mon1.fownxo [INF]
Successfully destroyed old osd.253 on dig-osd4; ready forreplacement
2025-04-28T10:00:28.228994+0200 mgr.dig-mon1.fownxo [INF] Zapping
devices for osd.253 on dig-osd4
2025-04-28T10:00:39.132028+0200 mgr.dig-mon1.fownxo [INF]osd.381 now down
2025-04-28T10:00:39.132599+0200 mgr.dig-mon1.fownxo [INF] Daemon
osd.381 on dig-osd6 was already removed
2025-04-28T10:00:39.133424+0200 mgr.dig-mon1.fownxo [INF]
Successfully destroyed old osd.381 on dig-osd6; ready forreplacement
except that the "Zapping.." message is not present for thesecond OSD...
I tried to increase the mgr log verbosity with 'ceph tell
mgr.dig-mon1.fownxo config set debug_mgr 20/20' and therestop/start
the upgrade without any additonal message displayed.

Michel

Le 28/04/2025 à 09:20, Eugen Block a écrit :
Have you increased the debug level for the mgr? It would surprise
me if stray daemons would really block an upgrade. But debug logs
might reveal something. And if it can be confirmed that the strays
really block the upgrade, you could either remove the OSDsentirely
(they are already drained) to continue upgrading, or create a
tracker issue to report this and wait for instructions.

Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
Hi Eugen,

Yes I stopped and restarted the upgrade several times already, in
particular after failing over the mgr. And the only messages
related are the upgrade started and upgrade canceled ones.Nothing
related to an error or a crash...

For me the question is why do I have stray daemons after removing
OSD. IMO it is unexpected as these daemons are not there anymore.
I can understand that stray daemons prevent the upgrade to start
if they are really strayed... And it would be nice if cephadm was
giving a message about why the upgrade does not really start
despite its status is "in progress"...

Best regards,

Michel
Sent from my mobile
Le 28 avril 2025 07:27:44 Eugen Block <ebl...@nde.ag> a écrit :
Do you see anything in the mgr log? To get fresh logs I wouldcancel
the upgrade (ceph orch upgrade stop) and then try again.
A workaround could be to manually upgrade the mgr daemons bychangingtheir unit.run file, but that would be my last resort. Btwmdid you
stop and start the upgrade after failing the mgr as well?

Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
Eugen,

Thanks for the hint. Here is the osd_remove_queue:
[root@ijc-mon1 ~]# ceph config-key getmgr/cephadm/osd_remove_queue|jq
[
  {
    "osd_id": 253,
    "started": true,
    "draining": false,
    "stopped": false,
    "replace": true,
    "force": false,
    "zap": true,
    "hostname": "dig-osd4",
    "drain_started_at": null,
    "drain_stopped_at": null,
    "drain_done_at": "2025-04-15T14:09:30.521534Z",
    "process_started_at": "2025-04-15T14:09:14.091592Z"
  },
  {
    "osd_id": 381,
    "started": true,
    "draining": false,
    "stopped": false,
    "replace": true,
    "force": false,
    "zap": false,
    "hostname": "dig-osd6",
    "drain_started_at": "2025-04-23T11:56:09.864724Z",
    "drain_stopped_at": null,
    "drain_done_at": "2025-04-25T06:53:03.678729Z",
    "process_started_at": "2025-04-23T11:56:05.924923Z"
  }
]
It is not empty the two stray daemons are listed. Not sureit theseentries are expected as I specified --replace... A similarissue wasreported in https://tracker.ceph.com/issues/67018 so beforeReef butthe cause may be different. Still not clear for me how toget out ofthis, except may be replacing the OSDs but this will takesome time...
Best regards,

Michel

Le 27/04/2025 à 10:21, Eugen Block a écrit :
Hi,
what's the current ceph status? Wasn't there a bug in earlyReef
versions preventing upgrades if there were removed OSDs in the
queue? But IIRC, the cephadm module would crash. Can you check

ceph config-key get mgr/cephadm/osd_remove_queue
And then I would check the mgr log, maybe set it to ahigher debug
level to see what's blocking it.

Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
Hi,
I tried to restart all the mgrs (we have 3, 1 active, 2standby)
by executing 3 times the `ceph mgr fail`, no impact. I don't
really understand why I get these stray daemons after doing a
'ceph orch osd rm --replace` but I think I have alwaysseen this.I tried to mute rather than disable the stray daemon checkbut itdoesn't help either. And I find strange this message every10sabout one of the destroyed OSD and only one, reporting itis downand already destroyed and saying it'll zap it (I think Ididn't
add --zap when I removed it as the underlying disk is dead).
I'm completely stuck with this upgrade and I don'tremember havingthis kind of problems in previous upgrades with cephadm...Any
idea where to look for the cause and/or how to fix it?

Best regards,

Michel

Le 24/04/2025 à 23:34, Michel Jouvin a écrit :
Hi,
I'm trying to upgrade a (cephadm) cluster from 18.2.2 to18.2.6,using 'ceph orch upgrade'. When I enter the command 'cephorchupgrade start --ceph-version 18.2.6', I receive a messagesayingthat the upgrade has been initiated, with a similarmessage in
the logs but nothing happens after this. 'ceph orch upgrade
status' says:

-------

[root@ijc-mon1 ~]# ceph orch upgrade status
{
    "target_image": "quay.io/ceph/ceph:v18.2.6",
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [],
    "progress": "",
    "message": "",
    "is_paused": false
}
-------

The first time I entered the command, the cluster status was
HEALTH_WARN because of 2 stray daemons (caused bydestroyed OSDs,rm --replace). I set mgr/cephadm/warn_on_stray_daemons tofalseto ignore these 2 daemons, the cluster is now HEALTH_OKbut itdoesn't help. Following a Red Hat KB entry, I tried tofailover
the mgr, stopped an restarted the upgrade but without any
improvement. I have not seen anything in the logs, exceptthatthere is an INF entry every 10s about the destroyed OSDsaying:
------

2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14079 : cephadm [INF] osd.253 now down
2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 ondig-osd4
was already removed
2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14081 : cephadm [INF] Successfullydestroyed old
osd.253 on dig-osd4; ready for replacement
2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz
(mgr.55376028) 14082 : cephadm [INF] Zapping devices forosd.253
on dig-osd4
-----

The message seems to be only for one of the 2 destroyed OSDs
since I restarted the mgr. May this be the cause for thestucked
upgrade? What can I do for fixing this?

Thanks in advance for any hint. Best regards,

Michel
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 18.2.2: Upgrade not starting (ceph orch upgrade)

Reply via email to