Hi Michel, I've seen this recently on Reef (OSD stuck in the rm queue with the orchestrator tryng to zap a device that had already been zapped).
I could reproduce this a few times by deleting a batch of OSDs running on the same node. The whole 'ceph orch osd rm' process would stop progressing when trying to remove the ~8th OSD. I suspect that ceph-volume or the orchestrator is misinformed at some point that the device has already been zapped, looping over and over trying to remove this device that doesn't exist anymore. I think you should now run 'ceph osd destroy <OSD_ID> --yes-i-really-mean-it'. Regards, Frédéric. ----- Le 30 Avr 25, à 10:28, Michel Jouvin michel.jou...@ijclab.in2p3.fr a écrit : > Eugen, > > Thanks, I forgot that operation started with the orchestrator can be > stopped. You were right: stopping the 'osd rm' was enough to unblock the > upgrade. I am not completely sure what is the consequence on the replace > flag: I have the feeling it has been lost somehow as the OSD is no > longer listed by 'ceph orch osd rm status' and 'ceph -s' reports now one > OSD down and 1 stray daemon instead of 2 stray daemons. > > Michel > > Le 30/04/2025 à 09:24, Eugen Block a écrit : >> You can stop the osd removal: >> >> ceph orch osd rm stop <OSD_ID> >> >> I'm not entirely sure what the orchestrator will do except for >> clearing the pending state, and since the OSDs are already marked as >> destroyed in the crush tree, I wouldn't expect anything weird. But >> it's worth a try, I guess. >> >> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >> >>> Hi, >>> >>> I had no time to really investigate more our problem yesterday. But I >>> realized one issue that may explain the problem with osd.253: the >>> underlying disk is so dead that it is no longer visible by the OS. >>> Probably I added --zap when I did the 'ceph orch osd rm' and thus it >>> is trying to do the zapping, fails as it doesn't find the disk and >>> retries indefinitely... I remain a little bit surprise that this >>> zapping error is not reported (without the traceback) at the INFO >>> level and requires DEBUG to be seen but it is a detail. I'm surprised >>> that Ceph is not giving up on zapping if it cannot access the device >>> or did I miss something and there is a way to stop this process? >>> >>> May be it is a corner case that has been fixed/improved since >>> 18.2.2... Anyway, the question remains: is there a way out of this >>> problem (that seems the only reason for the upgrade not really >>> starting) apart from getting the replacement device? >>> >>> Best regards, >>> >>> Michel >>> >>> Le 28/04/2025 à 18:19, Michel Jouvin a écrit : >>>> Hi Frédéric, >>>> >>>> Thanks for the command. I'm always looking at the wrong page of the >>>> doc! I looked at >>>> https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/ >>>> that list the Ceph subsystem and their default log level but there >>>> is no mention of cephadm there... After enabling cephadm debug log >>>> level and restarting the upgrade, I got the messages below. The only >>>> thing strange points to the problem with osd.253 where it tries to >>>> zap the device that was probably already zapped and thus cannot find >>>> the LV volume associated with osd.253. There is not really any other >>>> messages saying the impact on the upgrade but I guess it is the >>>> reason. What do you think ? And is there any way to fix it, other >>>> than replacing the OSD? >>>> >>>> Best regards, >>>> >>>> Michel >>>> >>>> --------------------- cephadm debug level log ------------------------- >>>> >>>> 2025-04-28T17:32:12.713746+0200 mgr.dig-mon1.fownxo [INF] Upgrade: >>>> Started with target quay.io/ceph/ceph:v18.2.6 >>>> 2025-04-28T17:32:14.822030+0200 mgr.dig-mon1.fownxo [DBG] Refreshed >>>> host dig-osd4 devices (23) >>>> 2025-04-28T17:32:14.822550+0200 mgr.dig-mon1.fownxo [DBG] Finding >>>> OSDSpecs for host: <dig-osd4> >>>> 2025-04-28T17:32:14.822614+0200 mgr.dig-mon1.fownxo [DBG] Generating >>>> OSDSpec previews for [] >>>> 2025-04-28T17:32:14.822695+0200 mgr.dig-mon1.fownxo [DBG] Loading >>>> OSDSpec previews to HostCache for host <dig-osd4> >>>> 2025-04-28T17:32:14.985257+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'config generate-minimal-conf' -> 0 in 0.005s >>>> 2025-04-28T17:32:15.262102+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'auth get' -> 0 in 0.277s >>>> 2025-04-28T17:32:15.262751+0200 mgr.dig-mon1.fownxo [DBG] Combine >>>> hosts with existing daemons [] + new hosts.... (very long line) >>>> >>>> 2025-04-28T17:32:15.416491+0200 mgr.dig-mon1.fownxo [DBG] >>>> _update_paused_health >>>> 2025-04-28T17:32:17.314607+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'osd df' -> 0 in 0.064s >>>> 2025-04-28T17:32:17.637526+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'osd df' -> 0 in 0.320s >>>> 2025-04-28T17:32:17.645703+0200 mgr.dig-mon1.fownxo [DBG] 2 OSDs are >>>> scheduled for removal: [osd.381, osd.253] >>>> 2025-04-28T17:32:17.661910+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'osd df' -> 0 in 0.011s >>>> 2025-04-28T17:32:17.667068+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'osd safe-to-destroy' -> 0 in 0.002s >>>> 2025-04-28T17:32:17.667117+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd >>>> safe-to-destroy returns: >>>> 2025-04-28T17:32:17.667164+0200 mgr.dig-mon1.fownxo [DBG] running >>>> cmd: osd down on ids [osd.381] >>>> 2025-04-28T17:32:17.667854+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'osd down' -> 0 in 0.001s >>>> 2025-04-28T17:32:17.667908+0200 mgr.dig-mon1.fownxo [INF] osd.381 >>>> now down >>>> 2025-04-28T17:32:17.668446+0200 mgr.dig-mon1.fownxo [INF] Daemon >>>> osd.381 on dig-osd6 was already removed >>>> 2025-04-28T17:32:17.669534+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'osd destroy-actual' -> 0 in 0.001s >>>> 2025-04-28T17:32:17.669675+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd >>>> destroy-actual returns: >>>> 2025-04-28T17:32:17.669789+0200 mgr.dig-mon1.fownxo [INF] >>>> Successfully destroyed old osd.381 on dig-osd6; ready for replacement >>>> 2025-04-28T17:32:17.669874+0200 mgr.dig-mon1.fownxo [DBG] Removing >>>> osd.381 from the queue. >>>> 2025-04-28T17:32:17.680411+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'osd df' -> 0 in 0.010s >>>> 2025-04-28T17:32:17.685141+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'osd safe-to-destroy' -> 0 in 0.002s >>>> 2025-04-28T17:32:17.685190+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd >>>> safe-to-destroy returns: >>>> 2025-04-28T17:32:17.685234+0200 mgr.dig-mon1.fownxo [DBG] running >>>> cmd: osd down on ids [osd.253] >>>> 2025-04-28T17:32:17.685710+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'osd down' -> 0 in 0.000s >>>> 2025-04-28T17:32:17.685759+0200 mgr.dig-mon1.fownxo [INF] osd.253 >>>> now down >>>> 2025-04-28T17:32:17.686186+0200 mgr.dig-mon1.fownxo [INF] Daemon >>>> osd.253 on dig-osd4 was already removed >>>> 2025-04-28T17:32:17.687068+0200 mgr.dig-mon1.fownxo [DBG] >>>> mon_command: 'osd destroy-actual' -> 0 in 0.001s >>>> 2025-04-28T17:32:17.687102+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd >>>> destroy-actual returns: >>>> 2025-04-28T17:32:17.687141+0200 mgr.dig-mon1.fownxo [INF] >>>> Successfully destroyed old osd.253 on dig-osd4; ready for replacement >>>> 2025-04-28T17:32:17.687176+0200 mgr.dig-mon1.fownxo [INF] Zapping >>>> devices for osd.253 on dig-osd4 >>>> 2025-04-28T17:32:17.687508+0200 mgr.dig-mon1.fownxo [DBG] >>>> _run_cephadm : command = ceph-volume >>>> 2025-04-28T17:32:17.687554+0200 mgr.dig-mon1.fownxo [DBG] >>>> _run_cephadm : args = ['--', 'lvm', 'zap', '--osd-id', '253', >>>> '--destroy'] >>>> 2025-04-28T17:32:17.687637+0200 mgr.dig-mon1.fownxo [DBG] osd >>>> container image >>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>> 2025-04-28T17:32:17.687677+0200 mgr.dig-mon1.fownxo [DBG] args: >>>> --image >>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>> --timeout 895 ceph-volume --fsid >>>> f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy >>>> 2025-04-28T17:32:17.687733+0200 mgr.dig-mon1.fownxo [DBG] Running >>>> command: which python3 >>>> 2025-04-28T17:32:17.731474+0200 mgr.dig-mon1.fownxo [DBG] Running >>>> command: /usr/bin/python3 >>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d >>>> --image >>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>> --timeout 895 ceph-volume --fsid >>>> f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy >>>> 2025-04-28T17:32:20.406723+0200 mgr.dig-mon1.fownxo [DBG] code: 1 >>>> 2025-04-28T17:32:20.406764+0200 mgr.dig-mon1.fownxo [DBG] err: >>>> Inferring config >>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/config/ceph.conf >>>> Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host >>>> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume >>>> --privileged --group-add=disk --init -e >>>> CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>> -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e >>>> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v >>>> /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z >>>> -v >>>> /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z >>>> -v >>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z >>>> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v >>>> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v >>>> /run/lock/lvm:/run/lock/lvm -v >>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro >>>> -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v >>>> /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z >>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>> lvm zap --osd-id 253 --destroy >>>> /usr/bin/podman: stderr Traceback (most recent call last): >>>> /usr/bin/podman: stderr File "/usr/sbin/ceph-volume", line 11, in >>>> <module> >>>> /usr/bin/podman: stderr load_entry_point('ceph-volume==1.0.0', >>>> 'console_scripts', 'ceph-volume')() >>>> /usr/bin/podman: stderr File >>>> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in >>>> __init__ >>>> /usr/bin/podman: stderr self.main(self.argv) >>>> /usr/bin/podman: stderr File >>>> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line >>>> 59, in newfunc >>>> /usr/bin/podman: stderr return f(*a, **kw) >>>> /usr/bin/podman: stderr File >>>> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in >>>> main >>>> /usr/bin/podman: stderr terminal.dispatch(self.mapper, >>>> subcommand_args) >>>> /usr/bin/podman: stderr File >>>> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line >>>> 194, in dispatch >>>> /usr/bin/podman: stderr instance.main() >>>> /usr/bin/podman: stderr File >>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", >>>> line 46, in main >>>> /usr/bin/podman: stderr terminal.dispatch(self.mapper, self.argv) >>>> /usr/bin/podman: stderr File >>>> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line >>>> 194, in dispatch >>>> /usr/bin/podman: stderr instance.main() >>>> /usr/bin/podman: stderr File >>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", >>>> line 403, in main >>>> /usr/bin/podman: stderr self.zap_osd() >>>> /usr/bin/podman: stderr File >>>> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line >>>> 16, in is_root >>>> /usr/bin/podman: stderr return func(*a, **kw) >>>> /usr/bin/podman: stderr File >>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", >>>> line 301, in zap_osd >>>> /usr/bin/podman: stderr devices = >>>> find_associated_devices(self.args.osd_id, self.args.osd_fsid) >>>> /usr/bin/podman: stderr File >>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", >>>> line 88, in find_associated_devices >>>> /usr/bin/podman: stderr '%s' % osd_id or osd_fsid) >>>> /usr/bin/podman: stderr RuntimeError: Unable to find any LV for >>>> zapping OSD: 253 >>>> Traceback (most recent call last): >>>> File "/usr/lib64/python3.9/runpy.py", line 197, in >>>> _run_module_as_main >>>> return _run_code(code, main_globals, None, >>>> File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code >>>> exec(code, run_globals) >>>> File >>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>> line 10700, in <module> >>>> File >>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>> line 10688, in main >>>> File >>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>> line 2445, in _infer_config >>>> File >>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>> line 2361, in _infer_fsid >>>> File >>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>> line 2473, in _infer_image >>>> File >>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>> line 2348, in _validate_fsid >>>> File >>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>> line 6970, in command_ceph_volume >>>> File >>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py", >>>> line 2136, in call_throws >>>> RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host >>>> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume >>>> --privileged --group-add=disk --init -e >>>> CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>> -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e >>>> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v >>>> /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z >>>> -v >>>> /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z >>>> -v >>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z >>>> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v >>>> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v >>>> /run/lock/lvm:/run/lock/lvm -v >>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro >>>> -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v >>>> /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z >>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f >>>> lvm zap --osd-id 253 --destroy >>>> 2025-04-28T17:32:20.409316+0200 mgr.dig-mon1.fownxo [DBG] serve loop >>>> sleep >>>> >>>> ----------------------- >>>> >>>> >>>> Le 28/04/2025 à 14:00, Frédéric Nass a écrit : >>>>> Hi Michel, >>>>> >>>>> You need to turn on cephadm debugging as described here [1] in the >>>>> documentation >>>>> >>>>> $ ceph config set mgr mgr/cephadm/log_to_cluster_level debug >>>>> >>>>> and then look for any hints with >>>>> >>>>> $ ceph -W cephadm --watch-debug >>>>> >>>>> or >>>>> >>>>> $ tail -f /var/log/ceph/$(ceph fsid)/ceph.cephadm.log (on the >>>>> active MGR) >>>>> >>>>> when you start/stop the upgrade. >>>>> >>>>> Regards, >>>>> Frédéric. >>>>> >>>>> [1] https://docs.ceph.com/en/reef/cephadm/operations/ >>>>> >>>>> ----- Le 28 Avr 25, à 12:52, Michel Jouvin >>>>> michel.jou...@ijclab.in2p3.fr a écrit : >>>>> >>>>>> Eugen, >>>>>> >>>>>> Thanks for doing the test. I scanned all logs and cannot find >>>>>> anything >>>>>> except the message mentioned displayed every 10s about the removed >>>>>> OSDs >>>>>> that led me to think there is something not exactly as expected... >>>>>> No clue >>>>>> what... >>>>>> >>>>>> Michel >>>>>> Sent from my mobile >>>>>> Le 28 avril 2025 12:43:23 Eugen Block <ebl...@nde.ag> a écrit : >>>>>> >>>>>>> I just tried this on a single-node virtual test cluster, deployed it >>>>>>> with 18.2.2. Then I removed one OSD with --replace flag (no --zap, >>>>>>> otherwise it would redeploy the OSD on that VM). Then I also see the >>>>>>> stray daemon warning, but the upgrade from 18.2.2 to 18.2.6 finished >>>>>>> successfully. That's why I don't think the stray daemon is the root >>>>>>> cause here. I would suggest to scan monitor and cephadm logs as >>>>>>> well. >>>>>>> After the upgrade to 18.2.6 the stray warning cleared, btw. >>>>>>> >>>>>>> >>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>>>>> >>>>>>>> Eugen, >>>>>>>> >>>>>>>> As said in a previous message, I found a tracker issue with a >>>>>>>> similar problem: https://tracker.ceph.com/issues/67018, even if the >>>>>>>> cause may be different as it is in older versions than me. For some >>>>>>>> reasons the sequence of messages every 10s is now back on the 2 >>>>>>>> OSDs: >>>>>>>> >>>>>>>> 2025-04-28T10:00:28.226741+0200 mgr.dig-mon1.fownxo [INF] >>>>>>>> osd.253 now down >>>>>>>> 2025-04-28T10:00:28.227249+0200 mgr.dig-mon1.fownxo [INF] Daemon >>>>>>>> osd.253 on dig-osd4 was already removed >>>>>>>> 2025-04-28T10:00:28.228929+0200 mgr.dig-mon1.fownxo [INF] >>>>>>>> Successfully destroyed old osd.253 on dig-osd4; ready for >>>>>>>> replacement >>>>>>>> 2025-04-28T10:00:28.228994+0200 mgr.dig-mon1.fownxo [INF] Zapping >>>>>>>> devices for osd.253 on dig-osd4 >>>>>>>> 2025-04-28T10:00:39.132028+0200 mgr.dig-mon1.fownxo [INF] >>>>>>>> osd.381 now down >>>>>>>> 2025-04-28T10:00:39.132599+0200 mgr.dig-mon1.fownxo [INF] Daemon >>>>>>>> osd.381 on dig-osd6 was already removed >>>>>>>> 2025-04-28T10:00:39.133424+0200 mgr.dig-mon1.fownxo [INF] >>>>>>>> Successfully destroyed old osd.381 on dig-osd6; ready for >>>>>>>> replacement >>>>>>>> >>>>>>>> except that the "Zapping.." message is not present for the >>>>>>>> second OSD... >>>>>>>> >>>>>>>> I tried to increase the mgr log verbosity with 'ceph tell >>>>>>>> mgr.dig-mon1.fownxo config set debug_mgr 20/20' and there >>>>>>>> stop/start >>>>>>>> the upgrade without any additonal message displayed. >>>>>>>> >>>>>>>> Michel >>>>>>>> >>>>>>>> Le 28/04/2025 à 09:20, Eugen Block a écrit : >>>>>>>>> Have you increased the debug level for the mgr? It would surprise >>>>>>>>> me if stray daemons would really block an upgrade. But debug logs >>>>>>>>> might reveal something. And if it can be confirmed that the strays >>>>>>>>> really block the upgrade, you could either remove the OSDs >>>>>>>>> entirely >>>>>>>>> (they are already drained) to continue upgrading, or create a >>>>>>>>> tracker issue to report this and wait for instructions. >>>>>>>>> >>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>>>>>>> >>>>>>>>>> Hi Eugen, >>>>>>>>>> >>>>>>>>>> Yes I stopped and restarted the upgrade several times already, in >>>>>>>>>> particular after failing over the mgr. And the only messages >>>>>>>>>> related are the upgrade started and upgrade canceled ones. >>>>>>>>>> Nothing >>>>>>>>>> related to an error or a crash... >>>>>>>>>> >>>>>>>>>> For me the question is why do I have stray daemons after removing >>>>>>>>>> OSD. IMO it is unexpected as these daemons are not there anymore. >>>>>>>>>> I can understand that stray daemons prevent the upgrade to start >>>>>>>>>> if they are really strayed... And it would be nice if cephadm was >>>>>>>>>> giving a message about why the upgrade does not really start >>>>>>>>>> despite its status is "in progress"... >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> >>>>>>>>>> Michel >>>>>>>>>> Sent from my mobile >>>>>>>>>> Le 28 avril 2025 07:27:44 Eugen Block <ebl...@nde.ag> a écrit : >>>>>>>>>> >>>>>>>>>>> Do you see anything in the mgr log? To get fresh logs I would >>>>>>>>>>> cancel >>>>>>>>>>> the upgrade (ceph orch upgrade stop) and then try again. >>>>>>>>>>> A workaround could be to manually upgrade the mgr daemons by >>>>>>>>>>> changing >>>>>>>>>>> their unit.run file, but that would be my last resort. Btwm >>>>>>>>>>> did you >>>>>>>>>>> stop and start the upgrade after failing the mgr as well? >>>>>>>>>>> >>>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>>>>>>>>> >>>>>>>>>>>> Eugen, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for the hint. Here is the osd_remove_queue: >>>>>>>>>>>> >>>>>>>>>>>> [root@ijc-mon1 ~]# ceph config-key get >>>>>>>>>>>> mgr/cephadm/osd_remove_queue|jq >>>>>>>>>>>> [ >>>>>>>>>>>> { >>>>>>>>>>>> "osd_id": 253, >>>>>>>>>>>> "started": true, >>>>>>>>>>>> "draining": false, >>>>>>>>>>>> "stopped": false, >>>>>>>>>>>> "replace": true, >>>>>>>>>>>> "force": false, >>>>>>>>>>>> "zap": true, >>>>>>>>>>>> "hostname": "dig-osd4", >>>>>>>>>>>> "drain_started_at": null, >>>>>>>>>>>> "drain_stopped_at": null, >>>>>>>>>>>> "drain_done_at": "2025-04-15T14:09:30.521534Z", >>>>>>>>>>>> "process_started_at": "2025-04-15T14:09:14.091592Z" >>>>>>>>>>>> }, >>>>>>>>>>>> { >>>>>>>>>>>> "osd_id": 381, >>>>>>>>>>>> "started": true, >>>>>>>>>>>> "draining": false, >>>>>>>>>>>> "stopped": false, >>>>>>>>>>>> "replace": true, >>>>>>>>>>>> "force": false, >>>>>>>>>>>> "zap": false, >>>>>>>>>>>> "hostname": "dig-osd6", >>>>>>>>>>>> "drain_started_at": "2025-04-23T11:56:09.864724Z", >>>>>>>>>>>> "drain_stopped_at": null, >>>>>>>>>>>> "drain_done_at": "2025-04-25T06:53:03.678729Z", >>>>>>>>>>>> "process_started_at": "2025-04-23T11:56:05.924923Z" >>>>>>>>>>>> } >>>>>>>>>>>> ] >>>>>>>>>>>> >>>>>>>>>>>> It is not empty the two stray daemons are listed. Not sure >>>>>>>>>>>> it these >>>>>>>>>>>> entries are expected as I specified --replace... A similar >>>>>>>>>>>> issue was >>>>>>>>>>>> reported in https://tracker.ceph.com/issues/67018 so before >>>>>>>>>>>> Reef but >>>>>>>>>>>> the cause may be different. Still not clear for me how to >>>>>>>>>>>> get out of >>>>>>>>>>>> this, except may be replacing the OSDs but this will take >>>>>>>>>>>> some time... >>>>>>>>>>>> >>>>>>>>>>>> Best regards, >>>>>>>>>>>> >>>>>>>>>>>> Michel >>>>>>>>>>>> >>>>>>>>>>>> Le 27/04/2025 à 10:21, Eugen Block a écrit : >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> what's the current ceph status? Wasn't there a bug in early >>>>>>>>>>>>> Reef >>>>>>>>>>>>> versions preventing upgrades if there were removed OSDs in the >>>>>>>>>>>>> queue? But IIRC, the cephadm module would crash. Can you check >>>>>>>>>>>>> >>>>>>>>>>>>> ceph config-key get mgr/cephadm/osd_remove_queue >>>>>>>>>>>>> >>>>>>>>>>>>> And then I would check the mgr log, maybe set it to a >>>>>>>>>>>>> higher debug >>>>>>>>>>>>> level to see what's blocking it. >>>>>>>>>>>>> >>>>>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I tried to restart all the mgrs (we have 3, 1 active, 2 >>>>>>>>>>>>>> standby) >>>>>>>>>>>>>> by executing 3 times the `ceph mgr fail`, no impact. I don't >>>>>>>>>>>>>> really understand why I get these stray daemons after doing a >>>>>>>>>>>>>> 'ceph orch osd rm --replace` but I think I have always >>>>>>>>>>>>>> seen this. >>>>>>>>>>>>>> I tried to mute rather than disable the stray daemon check >>>>>>>>>>>>>> but it >>>>>>>>>>>>>> doesn't help either. And I find strange this message every >>>>>>>>>>>>>> 10s >>>>>>>>>>>>>> about one of the destroyed OSD and only one, reporting it >>>>>>>>>>>>>> is down >>>>>>>>>>>>>> and already destroyed and saying it'll zap it (I think I >>>>>>>>>>>>>> didn't >>>>>>>>>>>>>> add --zap when I removed it as the underlying disk is dead). >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm completely stuck with this upgrade and I don't >>>>>>>>>>>>>> remember having >>>>>>>>>>>>>> this kind of problems in previous upgrades with cephadm... >>>>>>>>>>>>>> Any >>>>>>>>>>>>>> idea where to look for the cause and/or how to fix it? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Michel >>>>>>>>>>>>>> >>>>>>>>>>>>>> Le 24/04/2025 à 23:34, Michel Jouvin a écrit : >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm trying to upgrade a (cephadm) cluster from 18.2.2 to >>>>>>>>>>>>>>> 18.2.6, >>>>>>>>>>>>>>> using 'ceph orch upgrade'. When I enter the command 'ceph >>>>>>>>>>>>>>> orch >>>>>>>>>>>>>>> upgrade start --ceph-version 18.2.6', I receive a message >>>>>>>>>>>>>>> saying >>>>>>>>>>>>>>> that the upgrade has been initiated, with a similar >>>>>>>>>>>>>>> message in >>>>>>>>>>>>>>> the logs but nothing happens after this. 'ceph orch upgrade >>>>>>>>>>>>>>> status' says: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ------- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [root@ijc-mon1 ~]# ceph orch upgrade status >>>>>>>>>>>>>>> { >>>>>>>>>>>>>>> "target_image": "quay.io/ceph/ceph:v18.2.6", >>>>>>>>>>>>>>> "in_progress": true, >>>>>>>>>>>>>>> "which": "Upgrading all daemon types on all hosts", >>>>>>>>>>>>>>> "services_complete": [], >>>>>>>>>>>>>>> "progress": "", >>>>>>>>>>>>>>> "message": "", >>>>>>>>>>>>>>> "is_paused": false >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> ------- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The first time I entered the command, the cluster status was >>>>>>>>>>>>>>> HEALTH_WARN because of 2 stray daemons (caused by >>>>>>>>>>>>>>> destroyed OSDs, >>>>>>>>>>>>>>> rm --replace). I set mgr/cephadm/warn_on_stray_daemons to >>>>>>>>>>>>>>> false >>>>>>>>>>>>>>> to ignore these 2 daemons, the cluster is now HEALTH_OK >>>>>>>>>>>>>>> but it >>>>>>>>>>>>>>> doesn't help. Following a Red Hat KB entry, I tried to >>>>>>>>>>>>>>> failover >>>>>>>>>>>>>>> the mgr, stopped an restarted the upgrade but without any >>>>>>>>>>>>>>> improvement. I have not seen anything in the logs, except >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> there is an INF entry every 10s about the destroyed OSD >>>>>>>>>>>>>>> saying: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>>>>>>> (mgr.55376028) 14079 : cephadm [INF] osd.253 now down >>>>>>>>>>>>>>> 2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>>>>>>> (mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 on >>>>>>>>>>>>>>> dig-osd4 >>>>>>>>>>>>>>> was already removed >>>>>>>>>>>>>>> 2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>>>>>>> (mgr.55376028) 14081 : cephadm [INF] Successfully >>>>>>>>>>>>>>> destroyed old >>>>>>>>>>>>>>> osd.253 on dig-osd4; ready for replacement >>>>>>>>>>>>>>> 2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz >>>>>>>>>>>>>>> (mgr.55376028) 14082 : cephadm [INF] Zapping devices for >>>>>>>>>>>>>>> osd.253 >>>>>>>>>>>>>>> on dig-osd4 >>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The message seems to be only for one of the 2 destroyed OSDs >>>>>>>>>>>>>>> since I restarted the mgr. May this be the cause for the >>>>>>>>>>>>>>> stucked >>>>>>>>>>>>>>> upgrade? What can I do for fixing this? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks in advance for any hint. Best regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Michel >>>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>>> >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io