So the cluster wasn't as clean as I thought it was ;-) I found a couple of legacy OSDs (this lab cluster has been upgraded over the years from Luminous to Squid), maybe those led to this behaviour. Anyway, after draining the hosts again, removing all the orphaned daemons etc., I started over and this time draining the hosts worked as expected, except for one thing. I had the norecover flag set so the new OSDs wouldn't get any data because I wanted to purge them anyway. And the last (one) PG didn't want to be drained until I removed the norecover flag. That was a little unexpected but I'll leave it alone.

I think we can consider this thread closed as "invalid" (for now).

But thanks again for your response, Adam!

Zitat von Eugen Block <ebl...@nde.ag>:

Thanks, Adam.

Before I purged the nodes again, I looked at the current output of 'ceph orch ps', and indeed, there are still orphaned OSD daemons:

# ceph orch ps --daemon-type osd
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID osd.0 host5 running (17h) 7m ago 17h 201M 4096M 19.2.2 4892a7ef541b 0d3e81e80f8e osd.0 host6 stopped 7m ago 18h - 4096M <unknown> <unknown> <unknown> osd.1 host7 running (17h) 7m ago 18h 246M 4096M 19.2.2 4892a7ef541b 4b81e06eaea7 osd.2 host5 stopped 7m ago 18h - 4096M <unknown> <unknown> <unknown>


I didn't look at that output before. I'll try to clear that state and then repeat the test.


Zitat von Adam King <adk...@redhat.com>:

The daemons cephadm "knows" about is actually just based on the contents of
the /var/lib/ceph/<fsid>/ directory on each given host cephadm is managing.
If osd.6 was present, got removed by the host drain process, and then its
daemon directory was still on the host / there was still a container
running for osd.6, it sounds like a bug with the drain process and sounds
worthy of a ticket (I think this is what happened based on what you said).
If I'm misreading and this was a manual removal of osd.6 then it could be
possible that cephadm either hadn't checked the host for what daemons were
there since that removal happened (could verify by checking the REFRESHED
column of `ceph orch ps`, osd.6 should be listed there if you got this
error) or that removal process didn't clean up the daemon directory, in
which case I wouldn't consider it to have been a bug. Assuming it's the
former case and you can show what was actually left on the host for osd.6
or you have a consistent way to reproduce the failed removal, I can take a
look.

On Fri, Jul 25, 2025 at 8:01 AM Eugen Block <ebl...@nde.ag> wrote:

Hi *,

an unexpected issue occurred today, at least twice, so it seems kind
of reproducable. I've been preparing a demo in a (virtual) lab cluster
(19.2.2) and wanted to drain multiple hosts. The first time I didn't
pay much attention, but the draining seemed stuck (kind of a common
issue these days), so I intervened and cleaned up until I got into a
healthy state, all good. Then I did my thing, changed the crush tree,
added the removed hosts again, cephadm created the OSDs, backfill
finished successfully.

Now I wanted to reset the cluster again to my starting point, so I
issued the drain command again for multiple hosts (each host has 2
OSDs):

# for i in {5..8}; do ceph orch host drain host$i; done

This time all OSDs were drained successfully (I watched 'ceph orch osd
rm status'), so I wanted to remove the hosts, but it failed:

# for i in {5..8}; do ceph orch host rm host$i --rm-crush-entry; done
Removed  host 'host5'
Removed  host 'host6'
Removed  host 'host7'
Error EINVAL: Not allowed to remove host8 from cluster. The following
daemons are running in the host:
type                 id
-------------------- ---------------
osd                  6

Please run 'ceph orch host drain host8' to remove daemons from host


But there was nothing to drain anymore, osd.6 was already successfully
removed from the crush tree. But on host8 there was still a daemon I
had to clean up manually:

host8:~ # cephadm rm-daemon --name osd.6 --fsid
543967bc-e586-32b8-bd2c-2d8b8b168f02 --force

I compared the cephadm.log files (3 out of 4 to-be-drained hosts were
successfully drained) and on host8 the command rm-daemon was never
executed (until I did manually). Is this a known issue? It doesn't
seem to happen with only on host, at least I didn't notice in the
past. Should I create a tracker for this?

Thanks,
Eugen
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to