Hi,
this sound a bit like a "classic" quorum loss, resulting in cascading
daemon failures. There seems to be one surviving MON (blade3n2), that
could be a good starting point for disaster recovery by reducing the
monmap to this one MON. This should give you back qourum and a working
cluster. You might need to run 'systemctl reset-failed...' to let
systemd start the containers. Although the product has been
discontinued, the section of the SUSE docs [0] is still relevant
("Restoring the MONs quorum"). Back up the mon store of each node
before you do this, just in case. The procedure itself has worked many
times for me, but maybe there's an easier way, especially if you're
not too familiar with cephadm or this procedure.
But before you do that, do you have MON logs with an explanation why
they refuse to start?
Regarding Ceph images, your cluster uses af0c5903e901 for the Ceph
services, what does 'docker images | grep af0c5903e901' show? I doubt
it's "ceph-mon:latest", at least I haven't seen those in use with
cephadm (I have the impression that this is a "regular" cephadm
cluster).
Regards,
Eugen
[0]
https://documentation.suse.com/pt-br/ses/7.1/html/ses-all/bp-troubleshooting-monitors.html#mons-restoring-quorum
Zitat von Jacek Rużyczka via ceph-users <[email protected]>:
Hi,
The network my 4-node cluster uses broke down after a driver issue a week
ago. Now, as the network resumed normal operation, but my Ceph 19 cluster
first said HEALTH_WARN and informed me of a lengthy recovery process, but
some one hour later, I only found this error message:
mixtile@blade3n1:~$ sudo ceph -s
[sudo] password for mixtile:
2026-05-25T13:35:51.685+0200 ffff9701f180 0 monclient(hunting):
authenticate timed out after 300
[errno 110] RADOS timed out (error connecting to the cluster)
A restart of all nodes did *not* help. Even worse: The Docker containers
with the various processes (mon, mrg, crash,…) started disappearing one by
one! Here is what remained:
mixtile@blade3n1:~$ docker ps -a
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS NAMES
2039b18ba392 quay.io/prometheus/node-exporter:v1.7.0
"/bin/node_exporter …" 47 minutes ago Up 47 minutes
ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-node-exporter-blade3n1
16cb4a6822f2 af0c5903e901
"/usr/bin/ceph-crash…" 47 minutes ago Up 47
minutes ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-crash-blade3n1
mixtile@blade3n2:~$ docker ps -a
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS NAMES
42361a694abf quay.io/prometheus/prometheus:v2.51.0 "/bin/prometheus
--c…" 2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-prometheus-blade3n2
88b085d000a8 quay.io/prometheus/node-exporter:v1.7.0
"/bin/node_exporter …" 2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-node-exporter-blade3n2
7bb17808bdd8 quay.io/prometheus/alertmanager:v0.25.0 "/bin/alertmanager
-…" 2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-alertmanager-blade3n2
a43cfe36da29 quay.io/ceph/grafana:10.4.0 "/run.sh"
2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-grafana-blade3n2
a95140b707ee af0c5903e901
"/usr/bin/ceph-crash…" 2 hours ago Up 2
hours ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-crash-blade3n2
38d4ca4035b0 af0c5903e901 "/usr/bin/ceph-mon
-…" 2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-mon-blade3n2
mixtile@blade3n3:~$ docker ps -a
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS NAMES
d664dfe30bd8 af0c5903e901
"/usr/bin/ceph-crash…" 2 hours ago Up 2
hours ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-crash-blade3n3
a64ac00dc28b quay.io/prometheus/node-exporter:v1.7.0
"/bin/node_exporter …" 2 hours ago Up 2 hours
ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-node-exporter-blade3n3
f7de98403a10 netdata/netdata "/usr/sbin/run.sh"
mixtile@blade3n4:~$ docker ps -a
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS NAMES
d437cec7d6bf quay.io/prometheus/node-exporter:v1.7.0
"/bin/node_exporter …" 54 minutes ago Up 53 minutes
ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-node-exporter-blade3n4
c6d5ac595857 af0c5903e901
"/usr/bin/ceph-crash…" 54 minutes ago Up 53
minutes ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-crash-blade3n4
As you can see, there are much less Ceph-related processes than expected.
The rest hasn't only crashed: In fact, the corresponding images have also
disappeared! Pulling the missing containers didn't work:
mixtile@blade3n1:~$ docker run ceph-mon
Unable to find image 'ceph-mon:latest' locally
docker: Error response from daemon: pull access denied for ceph-mon,
repository does not exist or may require 'docker login'
This is my exact system and Ceph version BTW:
mixtile@blade3n1:~$ ceph -v
ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid
(stable)
mixtile@blade3n1:~$ uname -a
Linux blade3n1 6.1.0-1027-rockchip #27 SMP Sun Apr 27 01:54:34 UTC 2025
aarch64 aarch64 aarch64 GNU/Linux
The drives my data are stored on seem to be still there (as lsblk said),
but as all most OSD processes are gone, I can no longer access them. I've
got four hosts, of which #1 is the admin node. #2 also hosts Ganesha NFS
for external clients.
So: What can I do to bring my cluster back to life without endangering my
data? Thank you.
Kind regards
Jacek Rużyczka
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]