Hi,

The network my 4-node cluster uses broke down after a driver issue a week
ago. Now, as the network resumed normal operation, but my Ceph 19 cluster
first said HEALTH_WARN and informed me of a lengthy recovery process, but
some one hour later, I only found this error message:

mixtile@blade3n1:~$ sudo ceph -s
[sudo] password for mixtile:
2026-05-25T13:35:51.685+0200 ffff9701f180  0 monclient(hunting):
authenticate timed out after 300
[errno 110] RADOS timed out (error connecting to the cluster)

A restart of all nodes did *not* help. Even worse: The Docker containers
with the various processes (mon, mrg, crash,…) started disappearing one by
one! Here is what remained:

mixtile@blade3n1:~$ docker ps -a
CONTAINER ID   IMAGE                                     COMMAND
    CREATED          STATUS                    PORTS     NAMES
2039b18ba392   quay.io/prometheus/node-exporter:v1.7.0
  "/bin/node_exporter …" 47 minutes ago   Up 47 minutes
                      ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-node-exporter-blade3n1
16cb4a6822f2   af0c5903e901
                             "/usr/bin/ceph-crash…" 47 minutes ago   Up 47
minutes                       ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-crash-blade3n1

mixtile@blade3n2:~$ docker ps -a
CONTAINER ID   IMAGE                                     COMMAND
    CREATED        STATUS                    PORTS     NAMES
42361a694abf   quay.io/prometheus/prometheus:v2.51.0     "/bin/prometheus
--c…" 2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-prometheus-blade3n2
88b085d000a8   quay.io/prometheus/node-exporter:v1.7.0
  "/bin/node_exporter …" 2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-node-exporter-blade3n2
7bb17808bdd8   quay.io/prometheus/alertmanager:v0.25.0   "/bin/alertmanager
-…" 2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-alertmanager-blade3n2
a43cfe36da29   quay.io/ceph/grafana:10.4.0               "/run.sh"
    2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-grafana-blade3n2
a95140b707ee   af0c5903e901
                             "/usr/bin/ceph-crash…" 2 hours ago    Up 2
hours                          ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-crash-blade3n2
38d4ca4035b0   af0c5903e901                              "/usr/bin/ceph-mon
-…" 2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-mon-blade3n2

mixtile@blade3n3:~$ docker ps -a
CONTAINER ID   IMAGE                                     COMMAND
    CREATED        STATUS                    PORTS     NAMES
d664dfe30bd8   af0c5903e901
                             "/usr/bin/ceph-crash…" 2 hours ago    Up 2
hours                          ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-crash-blade3n3
a64ac00dc28b   quay.io/prometheus/node-exporter:v1.7.0
  "/bin/node_exporter …" 2 hours ago    Up 2 hours
                         ceph-8aad3073-39a1-11f1-bf6
e-f2704a1efa9b-node-exporter-blade3n3
f7de98403a10   netdata/netdata                           "/usr/sbin/run.sh"

mixtile@blade3n4:~$ docker ps -a
CONTAINER ID   IMAGE                                     COMMAND
               CREATED          STATUS                    PORTS     NAMES
d437cec7d6bf   quay.io/prometheus/node-exporter:v1.7.0
  "/bin/node_exporter …" 54 minutes ago   Up 53 minutes
                      ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-node-exporter-blade3n4
c6d5ac595857   af0c5903e901
                             "/usr/bin/ceph-crash…" 54 minutes ago   Up 53
minutes                       ceph-8aad3073-39a1-11f1-b
f6e-f2704a1efa9b-crash-blade3n4

As you can see, there are much less Ceph-related processes than expected.
The rest hasn't only crashed: In fact, the corresponding images have also
disappeared! Pulling the missing containers didn't work:

mixtile@blade3n1:~$ docker run ceph-mon
Unable to find image 'ceph-mon:latest' locally
docker: Error response from daemon: pull access denied for ceph-mon,
repository does not exist or may require 'docker login'

This is my exact system and Ceph version BTW:

mixtile@blade3n1:~$ ceph -v
ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid
(stable)

mixtile@blade3n1:~$ uname -a
Linux blade3n1 6.1.0-1027-rockchip #27 SMP Sun Apr 27 01:54:34 UTC 2025
aarch64 aarch64 aarch64 GNU/Linux

The drives my data are stored on seem to be still there (as lsblk said),
but as all most OSD processes are gone, I can no longer access them. I've
got four hosts, of which #1 is the admin node. #2 also hosts Ganesha NFS
for external clients.

So: What can I do to bring my cluster back to life without endangering my
data? Thank you.

Kind regards
Jacek Rużyczka
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to