[ceph-users] Re: Cluster Dead After Network Failure. Connection Timeout.

Eugen Block via ceph-users Tue, 26 May 2026 15:13:32 -0700

Hi,

this sound a bit like a "classic" quorum loss, resulting in cascadingdaemon failures. There seems to be one surviving MON (blade3n2), thatcould be a good starting point for disaster recovery by reducing themonmap to this one MON. This should give you back qourum and a workingcluster. You might need to run 'systemctl reset-failed...' to letsystemd start the containers. Although the product has beendiscontinued, the section of the SUSE docs [0] is still relevant("Restoring the MONs quorum"). Back up the mon store of each nodebefore you do this, just in case. The procedure itself has worked manytimes for me, but maybe there's an easier way, especially if you'renot too familiar with cephadm or this procedure.

But before you do that, do you have MON logs with an explanation whythey refuse to start?

Regarding Ceph images, your cluster uses af0c5903e901 for the Cephservices, what does 'docker images | grep af0c5903e901' show? I doubtit's "ceph-mon:latest", at least I haven't seen those in use withcephadm (I have the impression that this is a "regular" cephadmcluster).


Regards,
Eugen

[0]https://documentation.suse.com/pt-br/ses/7.1/html/ses-all/bp-troubleshooting-monitors.html#mons-restoring-quorum


Zitat von Jacek Rużyczka via ceph-users <[email protected]>:

Hi,

The network my 4-node cluster uses broke down after a driver issue a week
ago. Now, as the network resumed normal operation, but my Ceph 19 cluster
first said HEALTH_WARN and informed me of a lengthy recovery process, but
some one hour later, I only found this error message:

mixtile@blade3n1:~$ sudo ceph -s
[sudo] password for mixtile:
2026-05-25T13:35:51.685+0200 ffff9701f180  0 monclient(hunting):
authenticate timed out after 300
[errno 110] RADOS timed out (error connecting to the cluster)

A restart of all nodes did *not* help. Even worse: The Docker containers
with the various processes (mon, mrg, crash,…) started disappearing one by
one! Here is what remained:

mixtile@blade3n1:~$ docker ps -a
CONTAINER ID   IMAGE                                     COMMAND
    CREATED          STATUS                    PORTS     NAMES
2039b18ba392   quay.io/prometheus/node-exporter:v1.7.0
  "/bin/node_exporter …" 47 minutes ago   Up 47 minutes
                      ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-node-exporter-blade3n1
16cb4a6822f2   af0c5903e901
                             "/usr/bin/ceph-crash…" 47 minutes ago   Up 47
minutes                       ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-crash-blade3n1

mixtile@blade3n2:~$ docker ps -a
CONTAINER ID   IMAGE                                     COMMAND
    CREATED        STATUS                    PORTS     NAMES
42361a694abf   quay.io/prometheus/prometheus:v2.51.0     "/bin/prometheus
--c…" 2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-prometheus-blade3n2
88b085d000a8   quay.io/prometheus/node-exporter:v1.7.0
  "/bin/node_exporter …" 2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-node-exporter-blade3n2
7bb17808bdd8   quay.io/prometheus/alertmanager:v0.25.0   "/bin/alertmanager
-…" 2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-alertmanager-blade3n2
a43cfe36da29   quay.io/ceph/grafana:10.4.0               "/run.sh"
    2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-grafana-blade3n2
a95140b707ee   af0c5903e901
                             "/usr/bin/ceph-crash…" 2 hours ago    Up 2
hours                          ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-crash-blade3n2
38d4ca4035b0   af0c5903e901                              "/usr/bin/ceph-mon
-…" 2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-mon-blade3n2

mixtile@blade3n3:~$ docker ps -a
CONTAINER ID   IMAGE                                     COMMAND
    CREATED        STATUS                    PORTS     NAMES
d664dfe30bd8   af0c5903e901
                             "/usr/bin/ceph-crash…" 2 hours ago    Up 2
hours                          ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-crash-blade3n3
a64ac00dc28b   quay.io/prometheus/node-exporter:v1.7.0
  "/bin/node_exporter …" 2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-node-exporter-blade3n3
f7de98403a10   netdata/netdata                           "/usr/sbin/run.sh"

mixtile@blade3n4:~$ docker ps -a
CONTAINER ID   IMAGE                                     COMMAND
               CREATED          STATUS                    PORTS     NAMES
d437cec7d6bf   quay.io/prometheus/node-exporter:v1.7.0
  "/bin/node_exporter …" 54 minutes ago   Up 53 minutes
                      ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-node-exporter-blade3n4
c6d5ac595857   af0c5903e901
                             "/usr/bin/ceph-crash…" 54 minutes ago   Up 53
minutes                       ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-crash-blade3n4

As you can see, there are much less Ceph-related processes than expected.
The rest hasn't only crashed: In fact, the corresponding images have also
disappeared! Pulling the missing containers didn't work:

mixtile@blade3n1:~$ docker run ceph-mon
Unable to find image 'ceph-mon:latest' locally
docker: Error response from daemon: pull access denied for ceph-mon,
repository does not exist or may require 'docker login'

This is my exact system and Ceph version BTW:

mixtile@blade3n1:~$ ceph -v
ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid
(stable)

mixtile@blade3n1:~$ uname -a
Linux blade3n1 6.1.0-1027-rockchip #27 SMP Sun Apr 27 01:54:34 UTC 2025
aarch64 aarch64 aarch64 GNU/Linux

The drives my data are stored on seem to be still there (as lsblk said),
but as all most OSD processes are gone, I can no longer access them. I've
got four hosts, of which #1 is the admin node. #2 also hosts Ganesha NFS
for external clients.

So: What can I do to bring my cluster back to life without endangering my
data? Thank you.

Kind regards
Jacek Rużyczka
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]



_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Cluster Dead After Network Failure. Connection Timeout.

Reply via email to