Hi Michel,

I've seen this recently on Reef (OSD stuck in the rm queue with the 
orchestrator tryng to zap a device that had already been zapped).

I could reproduce this a few times by deleting a batch of OSDs running on the 
same node. The whole 'ceph orch osd rm' process would stop progressing when 
trying to remove the ~8th OSD. I suspect that ceph-volume or the orchestrator 
is misinformed at some point that the device has already been zapped, looping 
over and over trying to remove this device that doesn't exist anymore.

I think you should now run 'ceph osd destroy <OSD_ID> --yes-i-really-mean-it'.

Regards,
Frédéric.

----- Le 30 Avr 25, à 10:28, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
écrit :

> Eugen,
> 
> Thanks, I forgot that operation started with the orchestrator can be
> stopped. You were right: stopping the 'osd rm' was enough to unblock the
> upgrade. I am not completely sure what is the consequence on the replace
> flag: I have the feeling it has been lost somehow as the OSD is no
> longer listed by 'ceph orch osd rm status' and 'ceph -s' reports now one
> OSD down and 1 stray daemon instead of 2 stray daemons.
> 
> Michel
> 
> Le 30/04/2025 à 09:24, Eugen Block a écrit :
>> You can stop the osd removal:
>>
>> ceph orch osd rm stop <OSD_ID>
>>
>> I'm not entirely sure what the orchestrator will do except for
>> clearing the pending state, and since the OSDs are already marked as
>> destroyed in the crush tree, I wouldn't expect anything weird. But
>> it's worth a try, I guess.
>>
>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>
>>> Hi,
>>>
>>> I had no time to really investigate more our problem yesterday. But I
>>> realized one issue that may explain the problem with osd.253: the
>>> underlying disk is so dead that it is no longer visible by the OS.
>>> Probably I added --zap when I did the 'ceph orch osd rm' and thus it
>>> is trying to do the zapping, fails as it doesn't find the disk and
>>> retries indefinitely... I remain a little bit surprise that this
>>> zapping error is not reported (without the traceback) at the INFO
>>> level and requires DEBUG to be seen but it is a detail. I'm surprised
>>> that Ceph is not giving up on zapping if it cannot access the device
>>> or did I miss something and there is a way to stop this process?
>>>
>>> May be it is a corner case that has been fixed/improved since
>>> 18.2.2... Anyway, the question remains: is there a way out of this
>>> problem (that seems the only reason for the upgrade not really
>>> starting) apart from getting the replacement device?
>>>
>>> Best regards,
>>>
>>> Michel
>>>
>>> Le 28/04/2025 à 18:19, Michel Jouvin a écrit :
>>>> Hi Frédéric,
>>>>
>>>> Thanks for the command. I'm always looking at the wrong page of the
>>>> doc! I looked at
>>>> https://docs.ceph.com/en/latest/rados/troubleshooting/log-and-debug/
>>>> that list the Ceph subsystem and their default log level but there
>>>> is no mention of cephadm there... After enabling cephadm debug log
>>>> level and restarting the upgrade, I got the messages below. The only
>>>> thing strange points to the problem with osd.253 where it tries to
>>>> zap the device that was probably already zapped and thus cannot find
>>>> the LV volume associated with osd.253. There is not really any other
>>>> messages saying the impact on the upgrade but I guess it is the
>>>> reason. What do you think ? And is there any way to fix it, other
>>>> than replacing the OSD?
>>>>
>>>> Best regards,
>>>>
>>>> Michel
>>>>
>>>> --------------------- cephadm debug level log -------------------------
>>>>
>>>> 2025-04-28T17:32:12.713746+0200 mgr.dig-mon1.fownxo [INF] Upgrade:
>>>> Started with target quay.io/ceph/ceph:v18.2.6
>>>> 2025-04-28T17:32:14.822030+0200 mgr.dig-mon1.fownxo [DBG] Refreshed
>>>> host dig-osd4 devices (23)
>>>> 2025-04-28T17:32:14.822550+0200 mgr.dig-mon1.fownxo [DBG] Finding
>>>> OSDSpecs for host: <dig-osd4>
>>>> 2025-04-28T17:32:14.822614+0200 mgr.dig-mon1.fownxo [DBG] Generating
>>>> OSDSpec previews for []
>>>> 2025-04-28T17:32:14.822695+0200 mgr.dig-mon1.fownxo [DBG] Loading
>>>> OSDSpec previews to HostCache for host <dig-osd4>
>>>> 2025-04-28T17:32:14.985257+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'config generate-minimal-conf' -> 0 in 0.005s
>>>> 2025-04-28T17:32:15.262102+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'auth get' -> 0 in 0.277s
>>>> 2025-04-28T17:32:15.262751+0200 mgr.dig-mon1.fownxo [DBG] Combine
>>>> hosts with existing daemons [] + new hosts.... (very long line)
>>>>
>>>> 2025-04-28T17:32:15.416491+0200 mgr.dig-mon1.fownxo [DBG]
>>>> _update_paused_health
>>>> 2025-04-28T17:32:17.314607+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'osd df' -> 0 in 0.064s
>>>> 2025-04-28T17:32:17.637526+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'osd df' -> 0 in 0.320s
>>>> 2025-04-28T17:32:17.645703+0200 mgr.dig-mon1.fownxo [DBG] 2 OSDs are
>>>> scheduled for removal: [osd.381, osd.253]
>>>> 2025-04-28T17:32:17.661910+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'osd df' -> 0 in 0.011s
>>>> 2025-04-28T17:32:17.667068+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'osd safe-to-destroy' -> 0 in 0.002s
>>>> 2025-04-28T17:32:17.667117+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>> safe-to-destroy returns:
>>>> 2025-04-28T17:32:17.667164+0200 mgr.dig-mon1.fownxo [DBG] running
>>>> cmd: osd down on ids [osd.381]
>>>> 2025-04-28T17:32:17.667854+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'osd down' -> 0 in 0.001s
>>>> 2025-04-28T17:32:17.667908+0200 mgr.dig-mon1.fownxo [INF] osd.381
>>>> now down
>>>> 2025-04-28T17:32:17.668446+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>> osd.381 on dig-osd6 was already removed
>>>> 2025-04-28T17:32:17.669534+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'osd destroy-actual' -> 0 in 0.001s
>>>> 2025-04-28T17:32:17.669675+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>> destroy-actual returns:
>>>> 2025-04-28T17:32:17.669789+0200 mgr.dig-mon1.fownxo [INF]
>>>> Successfully destroyed old osd.381 on dig-osd6; ready for replacement
>>>> 2025-04-28T17:32:17.669874+0200 mgr.dig-mon1.fownxo [DBG] Removing
>>>> osd.381 from the queue.
>>>> 2025-04-28T17:32:17.680411+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'osd df' -> 0 in 0.010s
>>>> 2025-04-28T17:32:17.685141+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'osd safe-to-destroy' -> 0 in 0.002s
>>>> 2025-04-28T17:32:17.685190+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>> safe-to-destroy returns:
>>>> 2025-04-28T17:32:17.685234+0200 mgr.dig-mon1.fownxo [DBG] running
>>>> cmd: osd down on ids [osd.253]
>>>> 2025-04-28T17:32:17.685710+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'osd down' -> 0 in 0.000s
>>>> 2025-04-28T17:32:17.685759+0200 mgr.dig-mon1.fownxo [INF] osd.253
>>>> now down
>>>> 2025-04-28T17:32:17.686186+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>> osd.253 on dig-osd4 was already removed
>>>> 2025-04-28T17:32:17.687068+0200 mgr.dig-mon1.fownxo [DBG]
>>>> mon_command: 'osd destroy-actual' -> 0 in 0.001s
>>>> 2025-04-28T17:32:17.687102+0200 mgr.dig-mon1.fownxo [DBG] cmd: osd
>>>> destroy-actual returns:
>>>> 2025-04-28T17:32:17.687141+0200 mgr.dig-mon1.fownxo [INF]
>>>> Successfully destroyed old osd.253 on dig-osd4; ready for replacement
>>>> 2025-04-28T17:32:17.687176+0200 mgr.dig-mon1.fownxo [INF] Zapping
>>>> devices for osd.253 on dig-osd4
>>>> 2025-04-28T17:32:17.687508+0200 mgr.dig-mon1.fownxo [DBG]
>>>> _run_cephadm : command = ceph-volume
>>>> 2025-04-28T17:32:17.687554+0200 mgr.dig-mon1.fownxo [DBG]
>>>> _run_cephadm : args = ['--', 'lvm', 'zap', '--osd-id', '253',
>>>> '--destroy']
>>>> 2025-04-28T17:32:17.687637+0200 mgr.dig-mon1.fownxo [DBG] osd
>>>> container image
>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>> 2025-04-28T17:32:17.687677+0200 mgr.dig-mon1.fownxo [DBG] args:
>>>> --image
>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>> --timeout 895 ceph-volume --fsid
>>>> f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy
>>>> 2025-04-28T17:32:17.687733+0200 mgr.dig-mon1.fownxo [DBG] Running
>>>> command: which python3
>>>> 2025-04-28T17:32:17.731474+0200 mgr.dig-mon1.fownxo [DBG] Running
>>>> command: /usr/bin/python3
>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d
>>>> --image
>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>> --timeout 895 ceph-volume --fsid
>>>> f5195e24-158c-11ee-b338-5ced8c61b074 -- lvm zap --osd-id 253 --destroy
>>>> 2025-04-28T17:32:20.406723+0200 mgr.dig-mon1.fownxo [DBG] code: 1
>>>> 2025-04-28T17:32:20.406764+0200 mgr.dig-mon1.fownxo [DBG] err:
>>>> Inferring config
>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/config/ceph.conf
>>>> Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host
>>>> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
>>>> --privileged --group-add=disk --init -e
>>>> CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>> -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e
>>>> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
>>>> /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z
>>>> -v
>>>> /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z
>>>> -v
>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z
>>>> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v
>>>> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
>>>> /run/lock/lvm:/run/lock/lvm -v
>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro
>>>> -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v
>>>> /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z
>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>> lvm zap --osd-id 253 --destroy
>>>> /usr/bin/podman: stderr Traceback (most recent call last):
>>>> /usr/bin/podman: stderr   File "/usr/sbin/ceph-volume", line 11, in
>>>> <module>
>>>> /usr/bin/podman: stderr load_entry_point('ceph-volume==1.0.0',
>>>> 'console_scripts', 'ceph-volume')()
>>>> /usr/bin/podman: stderr   File
>>>> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in
>>>> __init__
>>>> /usr/bin/podman: stderr     self.main(self.argv)
>>>> /usr/bin/podman: stderr   File
>>>> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line
>>>> 59, in newfunc
>>>> /usr/bin/podman: stderr     return f(*a, **kw)
>>>> /usr/bin/podman: stderr   File
>>>> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in
>>>> main
>>>> /usr/bin/podman: stderr     terminal.dispatch(self.mapper,
>>>> subcommand_args)
>>>> /usr/bin/podman: stderr   File
>>>> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line
>>>> 194, in dispatch
>>>> /usr/bin/podman: stderr     instance.main()
>>>> /usr/bin/podman: stderr   File
>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py",
>>>> line 46, in main
>>>> /usr/bin/podman: stderr     terminal.dispatch(self.mapper, self.argv)
>>>> /usr/bin/podman: stderr   File
>>>> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line
>>>> 194, in dispatch
>>>> /usr/bin/podman: stderr     instance.main()
>>>> /usr/bin/podman: stderr   File
>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",
>>>> line 403, in main
>>>> /usr/bin/podman: stderr     self.zap_osd()
>>>> /usr/bin/podman: stderr   File
>>>> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line
>>>> 16, in is_root
>>>> /usr/bin/podman: stderr     return func(*a, **kw)
>>>> /usr/bin/podman: stderr   File
>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",
>>>> line 301, in zap_osd
>>>> /usr/bin/podman: stderr     devices =
>>>> find_associated_devices(self.args.osd_id, self.args.osd_fsid)
>>>> /usr/bin/podman: stderr   File
>>>> "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py",
>>>> line 88, in find_associated_devices
>>>> /usr/bin/podman: stderr     '%s' % osd_id or osd_fsid)
>>>> /usr/bin/podman: stderr RuntimeError: Unable to find any LV for
>>>> zapping OSD: 253
>>>> Traceback (most recent call last):
>>>>   File "/usr/lib64/python3.9/runpy.py", line 197, in
>>>> _run_module_as_main
>>>>     return _run_code(code, main_globals, None,
>>>>   File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
>>>>     exec(code, run_globals)
>>>>   File
>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>> line 10700, in <module>
>>>>   File
>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>> line 10688, in main
>>>>   File
>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>> line 2445, in _infer_config
>>>>   File
>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>> line 2361, in _infer_fsid
>>>>   File
>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>> line 2473, in _infer_image
>>>>   File
>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>> line 2348, in _validate_fsid
>>>>   File
>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>> line 6970, in command_ceph_volume
>>>>   File
>>>> "/var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/cephadm.2b9d7d139a9cb40289f2358faf49a109fc297c0a258bde893227c262c30bca8d/__main__.py",
>>>> line 2136, in call_throws
>>>> RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host
>>>> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume
>>>> --privileged --group-add=disk --init -e
>>>> CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>> -e NODE_NAME=dig-osd4 -e CEPH_USE_RANDOM_NONCE=1 -e
>>>> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
>>>> /var/run/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/run/ceph:z
>>>> -v
>>>> /var/log/ceph/f5195e24-158c-11ee-b338-5ced8c61b074:/var/log/ceph:z
>>>> -v
>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/crash:/var/lib/ceph/crash:z
>>>> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v
>>>> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
>>>> /run/lock/lvm:/run/lock/lvm -v
>>>> /var/lib/ceph/f5195e24-158c-11ee-b338-5ced8c61b074/selinux:/sys/fs/selinux:ro
>>>> -v /:/rootfs -v /etc/hosts:/etc/hosts:ro -v
>>>> /tmp/ceph-tmpgtvcw4gk:/etc/ceph/ceph.conf:z
>>>> quay.io/ceph/ceph@sha256:798f1b1e71ca1bbf76c687d8bcf5cd3e88640f044513ae55a0fb571502ae641f
>>>> lvm zap --osd-id 253 --destroy
>>>> 2025-04-28T17:32:20.409316+0200 mgr.dig-mon1.fownxo [DBG] serve loop
>>>> sleep
>>>>
>>>> -----------------------
>>>>
>>>>
>>>> Le 28/04/2025 à 14:00, Frédéric Nass a écrit :
>>>>> Hi Michel,
>>>>>
>>>>> You need to turn on cephadm debugging as described here [1] in the
>>>>> documentation
>>>>>
>>>>> $ ceph config set mgr mgr/cephadm/log_to_cluster_level debug
>>>>>
>>>>> and then look for any hints with
>>>>>
>>>>> $ ceph -W cephadm --watch-debug
>>>>>
>>>>> or
>>>>>
>>>>> $ tail -f /var/log/ceph/$(ceph fsid)/ceph.cephadm.log (on the
>>>>> active MGR)
>>>>>
>>>>> when you start/stop the upgrade.
>>>>>
>>>>> Regards,
>>>>> Frédéric.
>>>>>
>>>>> [1] https://docs.ceph.com/en/reef/cephadm/operations/
>>>>>
>>>>> ----- Le 28 Avr 25, à 12:52, Michel Jouvin
>>>>> michel.jou...@ijclab.in2p3.fr a écrit :
>>>>>
>>>>>> Eugen,
>>>>>>
>>>>>> Thanks for doing the test. I scanned all logs and cannot find
>>>>>> anything
>>>>>> except the message mentioned displayed every 10s about the removed
>>>>>> OSDs
>>>>>> that led me to think there is something not exactly as expected...
>>>>>> No clue
>>>>>> what...
>>>>>>
>>>>>> Michel
>>>>>> Sent from my mobile
>>>>>> Le 28 avril 2025 12:43:23 Eugen Block <ebl...@nde.ag> a écrit :
>>>>>>
>>>>>>> I just tried this on a single-node virtual test cluster, deployed it
>>>>>>> with 18.2.2. Then I removed one OSD with --replace flag (no --zap,
>>>>>>> otherwise it would redeploy the OSD on that VM). Then I also see the
>>>>>>> stray daemon warning, but the upgrade from 18.2.2 to 18.2.6 finished
>>>>>>> successfully. That's why I don't think the stray daemon is the root
>>>>>>> cause here. I would suggest to scan monitor and cephadm logs as
>>>>>>> well.
>>>>>>> After the upgrade to 18.2.6 the stray warning cleared, btw.
>>>>>>>
>>>>>>>
>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>>>>
>>>>>>>> Eugen,
>>>>>>>>
>>>>>>>> As said in a previous message, I found a tracker issue with a
>>>>>>>> similar problem: https://tracker.ceph.com/issues/67018, even if the
>>>>>>>> cause may be different as it is in older versions than me. For some
>>>>>>>> reasons the sequence of messages every 10s is now back on the 2
>>>>>>>> OSDs:
>>>>>>>>
>>>>>>>> 2025-04-28T10:00:28.226741+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>> osd.253 now down
>>>>>>>> 2025-04-28T10:00:28.227249+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>>>>>> osd.253 on dig-osd4 was already removed
>>>>>>>> 2025-04-28T10:00:28.228929+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>> Successfully destroyed old osd.253 on dig-osd4; ready for
>>>>>>>> replacement
>>>>>>>> 2025-04-28T10:00:28.228994+0200 mgr.dig-mon1.fownxo [INF] Zapping
>>>>>>>> devices for osd.253 on dig-osd4
>>>>>>>> 2025-04-28T10:00:39.132028+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>> osd.381 now down
>>>>>>>> 2025-04-28T10:00:39.132599+0200 mgr.dig-mon1.fownxo [INF] Daemon
>>>>>>>> osd.381 on dig-osd6 was already removed
>>>>>>>> 2025-04-28T10:00:39.133424+0200 mgr.dig-mon1.fownxo [INF]
>>>>>>>> Successfully destroyed old osd.381 on dig-osd6; ready for
>>>>>>>> replacement
>>>>>>>>
>>>>>>>> except that the "Zapping.." message is not present for the
>>>>>>>> second OSD...
>>>>>>>>
>>>>>>>> I tried to increase the mgr log verbosity with 'ceph tell
>>>>>>>> mgr.dig-mon1.fownxo config set debug_mgr 20/20' and there
>>>>>>>> stop/start
>>>>>>>> the upgrade without any additonal message displayed.
>>>>>>>>
>>>>>>>> Michel
>>>>>>>>
>>>>>>>> Le 28/04/2025 à 09:20, Eugen Block a écrit :
>>>>>>>>> Have you increased the debug level for the mgr? It would surprise
>>>>>>>>> me if stray daemons would really block an upgrade. But debug logs
>>>>>>>>> might reveal something. And if it can be confirmed that the strays
>>>>>>>>> really block the upgrade, you could either remove the OSDs
>>>>>>>>> entirely
>>>>>>>>> (they are already drained) to continue upgrading, or create a
>>>>>>>>> tracker issue to report this and wait for instructions.
>>>>>>>>>
>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>>>>>>
>>>>>>>>>> Hi Eugen,
>>>>>>>>>>
>>>>>>>>>> Yes I stopped and restarted the upgrade several times already, in
>>>>>>>>>> particular after failing over the mgr. And the only messages
>>>>>>>>>> related are the upgrade started and upgrade canceled ones.
>>>>>>>>>> Nothing
>>>>>>>>>> related to an error or a crash...
>>>>>>>>>>
>>>>>>>>>> For me the question is why do I have stray daemons after removing
>>>>>>>>>> OSD. IMO it is unexpected as these daemons are not there anymore.
>>>>>>>>>> I can understand that stray daemons prevent the upgrade to start
>>>>>>>>>> if they are really strayed... And it would be nice if cephadm was
>>>>>>>>>> giving a message about why the upgrade does not really start
>>>>>>>>>> despite its status is "in progress"...
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>>
>>>>>>>>>> Michel
>>>>>>>>>> Sent from my mobile
>>>>>>>>>> Le 28 avril 2025 07:27:44 Eugen Block <ebl...@nde.ag> a écrit :
>>>>>>>>>>
>>>>>>>>>>> Do you see anything in the mgr log? To get fresh logs I would
>>>>>>>>>>> cancel
>>>>>>>>>>> the upgrade (ceph orch upgrade stop) and then try again.
>>>>>>>>>>> A workaround could be to manually upgrade the mgr daemons by
>>>>>>>>>>> changing
>>>>>>>>>>> their unit.run file, but that would be my last resort. Btwm
>>>>>>>>>>> did you
>>>>>>>>>>> stop and start the upgrade after failing the mgr as well?
>>>>>>>>>>>
>>>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>>>>>>>>
>>>>>>>>>>>> Eugen,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the hint. Here is the osd_remove_queue:
>>>>>>>>>>>>
>>>>>>>>>>>> [root@ijc-mon1 ~]# ceph config-key get
>>>>>>>>>>>> mgr/cephadm/osd_remove_queue|jq
>>>>>>>>>>>> [
>>>>>>>>>>>>   {
>>>>>>>>>>>>     "osd_id": 253,
>>>>>>>>>>>>     "started": true,
>>>>>>>>>>>>     "draining": false,
>>>>>>>>>>>>     "stopped": false,
>>>>>>>>>>>>     "replace": true,
>>>>>>>>>>>>     "force": false,
>>>>>>>>>>>>     "zap": true,
>>>>>>>>>>>>     "hostname": "dig-osd4",
>>>>>>>>>>>>     "drain_started_at": null,
>>>>>>>>>>>>     "drain_stopped_at": null,
>>>>>>>>>>>>     "drain_done_at": "2025-04-15T14:09:30.521534Z",
>>>>>>>>>>>>     "process_started_at": "2025-04-15T14:09:14.091592Z"
>>>>>>>>>>>>   },
>>>>>>>>>>>>   {
>>>>>>>>>>>>     "osd_id": 381,
>>>>>>>>>>>>     "started": true,
>>>>>>>>>>>>     "draining": false,
>>>>>>>>>>>>     "stopped": false,
>>>>>>>>>>>>     "replace": true,
>>>>>>>>>>>>     "force": false,
>>>>>>>>>>>>     "zap": false,
>>>>>>>>>>>>     "hostname": "dig-osd6",
>>>>>>>>>>>>     "drain_started_at": "2025-04-23T11:56:09.864724Z",
>>>>>>>>>>>>     "drain_stopped_at": null,
>>>>>>>>>>>>     "drain_done_at": "2025-04-25T06:53:03.678729Z",
>>>>>>>>>>>>     "process_started_at": "2025-04-23T11:56:05.924923Z"
>>>>>>>>>>>>   }
>>>>>>>>>>>> ]
>>>>>>>>>>>>
>>>>>>>>>>>> It is not empty the two stray daemons are listed. Not sure
>>>>>>>>>>>> it these
>>>>>>>>>>>> entries are expected as I specified --replace... A similar
>>>>>>>>>>>> issue was
>>>>>>>>>>>> reported in https://tracker.ceph.com/issues/67018 so before
>>>>>>>>>>>> Reef but
>>>>>>>>>>>> the cause may be different. Still not clear for me how to
>>>>>>>>>>>> get out of
>>>>>>>>>>>> this, except may be replacing the OSDs but this will take
>>>>>>>>>>>> some time...
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Michel
>>>>>>>>>>>>
>>>>>>>>>>>> Le 27/04/2025 à 10:21, Eugen Block a écrit :
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> what's the current ceph status? Wasn't there a bug in early
>>>>>>>>>>>>> Reef
>>>>>>>>>>>>> versions preventing upgrades if there were removed OSDs in the
>>>>>>>>>>>>> queue? But IIRC, the cephadm module would crash. Can you check
>>>>>>>>>>>>>
>>>>>>>>>>>>> ceph config-key get mgr/cephadm/osd_remove_queue
>>>>>>>>>>>>>
>>>>>>>>>>>>> And then I would check the mgr log, maybe set it to a
>>>>>>>>>>>>> higher debug
>>>>>>>>>>>>> level to see what's blocking it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Zitat von Michel Jouvin <michel.jou...@ijclab.in2p3.fr>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried to restart all the mgrs (we have 3, 1 active, 2
>>>>>>>>>>>>>> standby)
>>>>>>>>>>>>>> by executing 3 times the `ceph mgr fail`, no impact. I don't
>>>>>>>>>>>>>> really understand why I get these stray daemons after doing a
>>>>>>>>>>>>>> 'ceph orch osd rm --replace` but I think I have always
>>>>>>>>>>>>>> seen this.
>>>>>>>>>>>>>> I tried to mute rather than disable the stray daemon check
>>>>>>>>>>>>>> but it
>>>>>>>>>>>>>> doesn't help either. And I find strange this message every
>>>>>>>>>>>>>> 10s
>>>>>>>>>>>>>> about one of the destroyed OSD and only one, reporting it
>>>>>>>>>>>>>> is down
>>>>>>>>>>>>>> and already destroyed and saying it'll zap it (I think I
>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>> add --zap when I removed it as the underlying disk is dead).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm completely stuck with this upgrade and I don't
>>>>>>>>>>>>>> remember having
>>>>>>>>>>>>>> this kind of problems in previous upgrades with cephadm...
>>>>>>>>>>>>>> Any
>>>>>>>>>>>>>> idea where to look for the cause and/or how to fix it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Michel
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Le 24/04/2025 à 23:34, Michel Jouvin a écrit :
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm trying to upgrade a (cephadm) cluster from 18.2.2 to
>>>>>>>>>>>>>>> 18.2.6,
>>>>>>>>>>>>>>> using 'ceph orch upgrade'. When I enter the command 'ceph
>>>>>>>>>>>>>>> orch
>>>>>>>>>>>>>>> upgrade start --ceph-version 18.2.6', I receive a message
>>>>>>>>>>>>>>> saying
>>>>>>>>>>>>>>> that the upgrade has been initiated, with a similar
>>>>>>>>>>>>>>> message in
>>>>>>>>>>>>>>> the logs but nothing happens after this. 'ceph orch upgrade
>>>>>>>>>>>>>>> status' says:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [root@ijc-mon1 ~]# ceph orch upgrade status
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>     "target_image": "quay.io/ceph/ceph:v18.2.6",
>>>>>>>>>>>>>>>     "in_progress": true,
>>>>>>>>>>>>>>>     "which": "Upgrading all daemon types on all hosts",
>>>>>>>>>>>>>>>     "services_complete": [],
>>>>>>>>>>>>>>>     "progress": "",
>>>>>>>>>>>>>>>     "message": "",
>>>>>>>>>>>>>>>     "is_paused": false
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> -------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The first time I entered the command, the cluster status was
>>>>>>>>>>>>>>> HEALTH_WARN because of 2 stray daemons (caused by
>>>>>>>>>>>>>>> destroyed OSDs,
>>>>>>>>>>>>>>> rm --replace). I set mgr/cephadm/warn_on_stray_daemons to
>>>>>>>>>>>>>>> false
>>>>>>>>>>>>>>> to ignore these 2 daemons, the cluster is now HEALTH_OK
>>>>>>>>>>>>>>> but it
>>>>>>>>>>>>>>> doesn't help. Following a Red Hat KB entry, I tried to
>>>>>>>>>>>>>>> failover
>>>>>>>>>>>>>>> the mgr, stopped an restarted the upgrade but without any
>>>>>>>>>>>>>>> improvement. I have not seen anything in the logs, except
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> there is an INF entry every 10s about the destroyed OSD
>>>>>>>>>>>>>>> saying:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2025-04-24T21:30:54.161988+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>> (mgr.55376028) 14079 : cephadm [INF] osd.253 now down
>>>>>>>>>>>>>>> 2025-04-24T21:30:54.162601+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>> (mgr.55376028) 14080 : cephadm [INF] Daemon osd.253 on
>>>>>>>>>>>>>>> dig-osd4
>>>>>>>>>>>>>>> was already removed
>>>>>>>>>>>>>>> 2025-04-24T21:30:54.164440+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>> (mgr.55376028) 14081 : cephadm [INF] Successfully
>>>>>>>>>>>>>>> destroyed old
>>>>>>>>>>>>>>> osd.253 on dig-osd4; ready for replacement
>>>>>>>>>>>>>>> 2025-04-24T21:30:54.164536+0000 mgr.ijc-mon1.yyfnhz
>>>>>>>>>>>>>>> (mgr.55376028) 14082 : cephadm [INF] Zapping devices for
>>>>>>>>>>>>>>> osd.253
>>>>>>>>>>>>>>> on dig-osd4
>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The message seems to be only for one of the 2 destroyed OSDs
>>>>>>>>>>>>>>> since I restarted the mgr. May this be the cause for the
>>>>>>>>>>>>>>> stucked
>>>>>>>>>>>>>>> upgrade? What can I do for fixing this?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks in advance for any hint. Best regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Michel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to