Hi all,
one problem solved, another coming up. For everyone ending up in the same
situation, the trick seems to be to get all OSDs marked up and then allow
recovery. Steps to take:
- set noout, nodown, norebalance, norecover
- wait patiently until all OSDs are shown as up
- unset norebalance, norecover
- wait wait wait, PGs will eventually become active as OSDs become responsive
- unset nodown, noout
Now the new problem. I now have an ever growing list of OSDs listed as
rebalancing, but nothing is actually rebalancing. How can I stop this growth
and how can I get rid of this list:
[root@gnosis ~]# ceph status
cluster:
id: XXX
health: HEALTH_WARN
noout flag(s) set
Slow OSD heartbeats on back (longest 634775.858ms)
Slow OSD heartbeats on front (longest 635210.412ms)
1 pools nearfull
services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 6m)
mgr: ceph-25(active, since 57m), standbys: ceph-26, ceph-01, ceph-02,
ceph-03
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1260 osds: 1258 up (since 24m), 1258 in (since 45m)
flags noout
data:
pools: 14 pools, 25065 pgs
objects: 1.97G objects, 3.5 PiB
usage: 4.4 PiB used, 8.7 PiB / 13 PiB avail
pgs: 25028 active+clean
30 active+clean+scrubbing+deep
7 active+clean+scrubbing
io:
client: 1.3 GiB/s rd, 718 MiB/s wr, 7.71k op/s rd, 2.54k op/s wr
progress:
Rebalancing after osd.135 marked in (1s)
[=====================.......]
Rebalancing after osd.69 marked in (2s)
[========================....]
Rebalancing after osd.75 marked in (2s)
[=======================.....]
Rebalancing after osd.173 marked in (2s)
[========================....]
Rebalancing after osd.42 marked in (1s)
[=============...............] (remaining: 2s)
Rebalancing after osd.104 marked in (2s)
[========================....]
Rebalancing after osd.82 marked in (2s)
[========================....]
Rebalancing after osd.107 marked in (2s)
[=======================.....]
Rebalancing after osd.19 marked in (2s)
[=======================.....]
Rebalancing after osd.67 marked in (2s)
[=====================.......]
Rebalancing after osd.46 marked in (2s)
[===================.........] (remaining: 1s)
Rebalancing after osd.123 marked in (2s)
[=======================.....]
Rebalancing after osd.66 marked in (2s)
[====================........]
Rebalancing after osd.12 marked in (2s)
[==============..............] (remaining: 2s)
Rebalancing after osd.95 marked in (2s)
[=====================.......]
Rebalancing after osd.134 marked in (2s)
[=======================.....]
Rebalancing after osd.14 marked in (1s)
[===================.........]
Rebalancing after osd.56 marked in (2s)
[=====================.......]
Rebalancing after osd.143 marked in (1s)
[========================....]
Rebalancing after osd.118 marked in (2s)
[=======================.....]
Rebalancing after osd.96 marked in (2s)
[========================....]
Rebalancing after osd.105 marked in (2s)
[=======================.....]
Rebalancing after osd.44 marked in (1s)
[=======.....................] (remaining: 5s)
Rebalancing after osd.41 marked in (1s)
[==============..............] (remaining: 1s)
Rebalancing after osd.9 marked in (2s)
[=...........................] (remaining: 37s)
Rebalancing after osd.58 marked in (2s)
[======......................] (remaining: 8s)
Rebalancing after osd.140 marked in (1s)
[=======================.....]
Rebalancing after osd.132 marked in (2s)
[========================....]
Rebalancing after osd.31 marked in (1s)
[=========================...]
Rebalancing after osd.110 marked in (2s)
[========================....]
Rebalancing after osd.21 marked in (2s)
[=========================...]
Rebalancing after osd.114 marked in (2s)
[=======================.....]
Rebalancing after osd.83 marked in (2s)
[=======================.....]
Rebalancing after osd.23 marked in (1s)
[=======================.....]
Rebalancing after osd.25 marked in (1s)
[==========================..]
Rebalancing after osd.147 marked in (2s)
[========================....]
Rebalancing after osd.62 marked in (1s)
[======================......]
Rebalancing after osd.57 marked in (2s)
[======================......]
Rebalancing after osd.61 marked in (2s)
[====================........]
Rebalancing after osd.71 marked in (2s)
[===================.........]
Rebalancing after osd.80 marked in (2s)
[======================......]
Rebalancing after osd.92 marked in (2s)
[=====================.......]
Rebalancing after osd.171 marked in (2s)
[========================....]
Rebalancing after osd.11 marked in (2s)
[===========.................] (remaining: 2s)
Rebalancing after osd.90 marked in (2s)
[====================........]
Rebalancing after osd.54 marked in (2s)
[====================........]
Rebalancing after osd.45 marked in (2s)
[===================.........] (remaining: 1s)
Rebalancing after osd.53 marked in (1s)
[====================........]
Rebalancing after osd.22 marked in (3s)
[=======================.....]
Rebalancing after osd.27 marked in (2s)
[========================....]
Rebalancing after osd.37 marked in (2s)
[===.........................] (remaining: 14s)
Rebalancing after osd.94 marked in (2s)
[=======================.....]
Rebalancing after osd.55 marked in (2s)
[=====.......................] (remaining: 10s)
Rebalancing after osd.35 marked in (2s)
[=...........................] (remaining: 31s)
Rebalancing after osd.43 marked in (2s)
[================............] (remaining: 2s)
Rebalancing after osd.13 marked in (2s)
[=============...............] (remaining: 2s)
Rebalancing after osd.79 marked in (2s)
[=========================...]
Rebalancing after osd.50 marked in (2s)
[======......................] (remaining: 7s)
Rebalancing after osd.33 marked in (1s)
[............................]
Rebalancing after osd.20 marked in (1s)
[=======================.....]
Rebalancing after osd.59 marked in (2s)
[=====================.......]
Rebalancing after osd.101 marked in (2s)
[======================......]
Rebalancing after osd.49 marked in (2s)
[=====.......................] (remaining: 9s)
Rebalancing after osd.36 marked in (2s)
[==..........................] (remaining: 20s)
Rebalancing after osd.133 marked in (2s)
[=======================.....]
Rebalancing after osd.29 marked in (2s)
[======================......]
Rebalancing after osd.8 marked in (2s)
[===.........................] (remaining: 14s)
Rebalancing after osd.16 marked in (2s)
[========================....]
Rebalancing after osd.38 marked in (2s)
[===========.................] (remaining: 2s)
Rebalancing after osd.68 marked in (2s)
[=======================.....]
Rebalancing after osd.130 marked in (2s)
[======================......]
Rebalancing after osd.117 marked in (2s)
[======================......]
Rebalancing after osd.155 marked in (2s)
[========================....]
Rebalancing after osd.10 marked in (2s)
[==============..............] (remaining: 1s)
Rebalancing after osd.141 marked in (1s)
[=======================.....]
Rebalancing after osd.52 marked in (2s)
[====================........] (remaining: 1s)
Rebalancing after osd.177 marked in (1s)
[=======================.....]
Rebalancing after osd.97 marked in (1s)
[=======================.....]
Rebalancing after osd.98 marked in (1s)
[======================......]
Rebalancing after osd.88 marked in (2s)
[=====================.......]
Rebalancing after osd.116 marked in (2s)
[========================....]
Rebalancing after osd.108 marked in (2s)
[======================......]
Rebalancing after osd.17 marked in (1s)
[=====================.......]
Rebalancing after osd.129 marked in (2s)
[====================........]
Rebalancing after osd.167 marked in (2s)
[======================......]
Rebalancing after osd.152 marked in (2s)
[=======================.....]
Rebalancing after osd.77 marked in (2s)
[=======================.....]
Rebalancing after osd.5 marked in (2s)
[========....................] (remaining: 5s)
Rebalancing after osd.121 marked in (1s)
[======================......]
Rebalancing after osd.26 marked in (2s)
[==========================..]
Rebalancing after osd.91 marked in (2s)
[=======================.....]
Rebalancing after osd.81 marked in (2s)
[========================....]
Rebalancing after osd.48 marked in (2s)
[=====.......................] (remaining: 9s)
Rebalancing after osd.32 marked in (2s)
[=====================.......]
Rebalancing after osd.125 marked in (2s)
[========================....]
Rebalancing after osd.111 marked in (2s)
[======================......]
Rebalancing after osd.151 marked in (2s)
[======================......]
Rebalancing after osd.39 marked in (2s)
[============................] (remaining: 2s)
Rebalancing after osd.136 marked in (2s)
[========================....]
Rebalancing after osd.112 marked in (1s)
[=========================...]
Rebalancing after osd.154 marked in (1s)
[=========================...]
Rebalancing after osd.64 marked in (2s)
[===================.........]
Rebalancing after osd.34 marked in (2s)
[............................] (remaining: 90s)
Rebalancing after osd.161 marked in (1s)
[========================....]
Rebalancing after osd.160 marked in (2s)
[=======================.....]
Rebalancing after osd.142 marked in (2s)
[=======================.....]
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <[email protected]>
Sent: Wednesday, July 12, 2023 9:53 AM
To: [email protected]
Subject: [ceph-users] Cluster down after network outage
Hi all,
we had a network outage tonight (power loss) and restored network in the
morning. All OSDs were running during this period. After restoring network
peering hell broke loose and the cluster has a hard time coming back up again.
OSDs get marked down all the time and come back later. Peering never stops.
Below is the current status, I had all OSDs shown as up for a while, but many
were not responsive. Are there some flags that help bringing things up in a
sequence that causes less overload on the system?
[root@gnosis ~]# ceph status
cluster:
id: XXX
health: HEALTH_WARN
2 clients failing to respond to capability release
6 MDSs report slow metadata IOs
3 MDSs report slow requests
nodown,noout,nobackfill,norecover flag(s) set
176 osds down
Slow OSD heartbeats on back (longest 551718.679ms)
Slow OSD heartbeats on front (longest 549598.330ms)
Reduced data availability: 8069 pgs inactive, 3786 pgs down, 3161
pgs peering, 1341 pgs stale
Degraded data redundancy: 1187354920/16402772667 objects degraded
(7.239%), 6222 pgs degraded, 6231 pgs undersized
1 pools nearfull
17386 slow ops, oldest one blocked for 1811 sec, daemons
[osd.1128,osd.1152,osd.1154,osd.12,osd.1227,osd.1244,osd.328,osd.354,osd.381,osd.4]...
have slow ops.
services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 28m)
mgr: ceph-25(active, since 30m), standbys: ceph-26, ceph-01, ceph-02,
ceph-03
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1260 osds: 1082 up (since 6m), 1258 in (since 18m); 266 remapped pgs
flags nodown,noout,nobackfill,norecover
data:
pools: 14 pools, 25065 pgs
objects: 1.91G objects, 3.4 PiB
usage: 3.1 PiB used, 6.0 PiB / 9.0 PiB avail
pgs: 0.626% pgs unknown
31.566% pgs not active
1187354920/16402772667 objects degraded (7.239%)
51/16402772667 objects misplaced (0.000%)
11706 active+clean
4752 active+undersized+degraded
3286 down
2702 peering
799 undersized+degraded+peered
464 stale+down
418 stale+active+undersized+degraded
214 remapped+peering
157 unknown
128 stale+peering
117 stale+remapped+peering
101 stale+undersized+degraded+peered
57 stale+active+undersized+degraded+remapped+backfilling
35 down+remapped
26 stale+undersized+degraded+remapped+backfilling+peered
23 undersized+degraded+remapped+backfilling+peered
14 active+clean+scrubbing+deep
9 stale+active+undersized+degraded+remapped+backfill_wait
7 active+recovering+undersized+degraded
7 stale+active+recovering+undersized+degraded
6 active+undersized+degraded+remapped+backfilling
6 active+undersized
5 active+undersized+degraded+remapped+backfill_wait
5 stale+remapped
4 stale+activating+undersized+degraded
3 active+undersized+remapped
3 stale+undersized+degraded+remapped+backfill_wait+peered
1 activating+undersized+degraded
1 activating+undersized+degraded+remapped
1 undersized+degraded+remapped+backfill_wait+peered
1 stale+active+clean
1 active+recovering
1 stale+down+remapped
1 undersized+peered
1 active+undersized+degraded+remapped
1 active+clean+scrubbing
1 active+clean+remapped
1 active+recovering+degraded
io:
client: 1.8 MiB/s rd, 18 MiB/s wr, 409 op/s rd, 796 op/s wr
Thanks for any hints!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]