[ceph-users] Problems with orchestrator

Carsten Götze via ceph-users Tue, 16 Dec 2025 04:54:53 -0800

Hi!

I'm running a ceph cluster using cephadm and recently upgraded from squid to 
tentacle 20.2.0.
Until recently everything worked fine until I started the nfs module. The nfs 
daemons were reported to be running but after some 10 minutes or so all of them 
were reported to be dead except one. nfs service on port 2049 was never 
provided on any of the nodes even while the daemons were supposed to be running.
As i found out later, the nfs daemons were never started at all, because the 
setup process required a systemd-firewalld to be installed on the system which 
of course wasn't. 
After some headaches with the newly installed firewalld I decided to roll back, 
delete the firewalld and postpone the nfs deployment.
I then tried to stop the nfs daemons with 'ceph orch daemon stop', which did 
nothing, even after waiting some 10 minutes. I had to reissue the command 
several times to make the reportedly dead nfs daemons vanish from the 'ceph 
orch ps' list. The one daemon that was reported to be still running however 
would only die after 'ceph orch daemon stop --force' and was in an 'error' 
state thereafter and could not be removed from the 'ceph orch ps' list by no 
means. So I decided to delete the managing nfs service from the 'ceph orch ls' 
list, in hope that it would also tear down the remaining nfs daemon.
This obviously was a bad idea since, the service is now in the state of 
deleting. However it cannot be deleted, because there is still the one daemon 
in error state, which cannot be deleted because it was never running at all.
As a last measure I forcefully removed the docker container on the node with 
the cephadm command, but even though there are no traces left of that nfs 
daemon, it is still listed when running 'ceph orch ps'.
I also noticed that the 'ceph orch device ls' is out of sync with reality and 
'ceph orch ps' is still listing osds that I've already shutdown and deleted. I 
therefore suspect, that the orchestrator has stopped collecting state 
information from the nodes.
Is there a way to force the orchestrator to sync its state information with the 
nodes?
Where do I find meaningful logs for the orchestrator?


With best regards,
Carsten Götze
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Problems with orchestrator

Reply via email to