I transitioned some servers to a new rack and now I'm having major issues
with Ceph upon bringing things back up.
I believe the issue may be related to the ceph nodes coming back up with
different IPs before VLANs were set. That's just a guess because I can't
think of any other reason this would happen.
Current state:
Every 2.0s: ceph -s
cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
cluster:
id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
health: HEALTH_WARN
1 filesystem is degraded
2 MDSs report slow metadata IOs
2/5 mons down, quorum cn02,cn03,cn01
9 osds down
3 hosts (17 osds) down
Reduced data availability: 97 pgs inactive, 9 pgs down
Degraded data redundancy: 13860144/30824413 objects degraded
(44.965%), 411 pgs degraded, 482 pgs undersized
services:
mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05,
cn04
mgr: cn02.arszct(active, since 5m)
mds: 2/2 daemons up, 2 standby
osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs
data:
volumes: 1/2 healthy, 1 recovering
pools: 8 pools, 545 pgs
objects: 7.71M objects, 6.7 TiB
usage: 15 TiB used, 39 TiB / 54 TiB avail
pgs: 0.367% pgs unknown
17.431% pgs not active
13860144/30824413 objects degraded (44.965%)
1137693/30824413 objects misplaced (3.691%)
280 active+undersized+degraded
67 undersized+degraded+remapped+backfilling+peered
57 active+undersized+remapped
45 active+clean+remapped
44 active+undersized+degraded+remapped+backfilling
18 undersized+degraded+peered
10 active+undersized
9 down
7 active+clean
3 active+undersized+remapped+backfilling
2 active+undersized+degraded+remapped+backfill_wait
2 unknown
1 undersized+peered
io:
client: 170 B/s rd, 0 op/s rd, 0 op/s wr
recovery: 168 MiB/s, 158 keys/s, 166 objects/s
I have to disable and re-enable the dashboard just to use it. It seems to
get bogged down after a few moments.
The three servers that were moved to the new rack Ceph has marked as
"Down", but if I do a cephadm host-check, they all seem to pass:
************************ ceph ************************
--------- cn01.ceph.---------
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
--------- cn02.ceph.---------
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
--------- cn03.ceph.---------
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
--------- cn04.ceph.---------
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
--------- cn05.ceph.---------
podman|docker (/usr/bin/podman) is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
--------- cn06.ceph.---------
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
It seems to be recovering with what it has left, but a large amount of OSDs
are down. When trying to restart one of the down'd OSDs, I see a huge dump.
Jul 25 03:19:38 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:19:38.532+0000 7fce14a6c080 0 osd.34 30689 done with init,
starting boot process
Jul 25 03:19:38 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:19:38.532+0000 7fce14a6c080 1 osd.34 30689 start_boot
Jul 25 03:20:10 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:20:10.655+0000 7fcdfd12d700 1 osd.34 30689 start_boot
Jul 25 03:20:41 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:20:41.159+0000 7fcdfd12d700 1 osd.34 30689 start_boot
Jul 25 03:21:11 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:21:11.662+0000 7fcdfd12d700 1 osd.34 30689 start_boot
At this point it just keeps printing start_boot, but the dashboard has it
marked as "in" but "down".
On these three hosts that moved, there were a bunch marked as "out" and
"down", and some with "in" but "down".
Not sure where to go next. I'm going to let the recovery continue and hope
that my 4x replication on these pools saves me.
Not sure where to go from here. Any help is very much appreciated. This
Ceph cluster holds all of our Cloudstack images... it would be terrible to
lose this data.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]