Hi all,
we had a network outage tonight (power loss) and restored network in the
morning. All OSDs were running during this period. After restoring network
peering hell broke loose and the cluster has a hard time coming back up again.
OSDs get marked down all the time and come back later. Peering never stops.
Below is the current status, I had all OSDs shown as up for a while, but many
were not responsive. Are there some flags that help bringing things up in a
sequence that causes less overload on the system?
[root@gnosis ~]# ceph status
cluster:
id: XXX
health: HEALTH_WARN
2 clients failing to respond to capability release
6 MDSs report slow metadata IOs
3 MDSs report slow requests
nodown,noout,nobackfill,norecover flag(s) set
176 osds down
Slow OSD heartbeats on back (longest 551718.679ms)
Slow OSD heartbeats on front (longest 549598.330ms)
Reduced data availability: 8069 pgs inactive, 3786 pgs down, 3161
pgs peering, 1341 pgs stale
Degraded data redundancy: 1187354920/16402772667 objects degraded
(7.239%), 6222 pgs degraded, 6231 pgs undersized
1 pools nearfull
17386 slow ops, oldest one blocked for 1811 sec, daemons
[osd.1128,osd.1152,osd.1154,osd.12,osd.1227,osd.1244,osd.328,osd.354,osd.381,osd.4]...
have slow ops.
services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 28m)
mgr: ceph-25(active, since 30m), standbys: ceph-26, ceph-01, ceph-02,
ceph-03
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1260 osds: 1082 up (since 6m), 1258 in (since 18m); 266 remapped pgs
flags nodown,noout,nobackfill,norecover
data:
pools: 14 pools, 25065 pgs
objects: 1.91G objects, 3.4 PiB
usage: 3.1 PiB used, 6.0 PiB / 9.0 PiB avail
pgs: 0.626% pgs unknown
31.566% pgs not active
1187354920/16402772667 objects degraded (7.239%)
51/16402772667 objects misplaced (0.000%)
11706 active+clean
4752 active+undersized+degraded
3286 down
2702 peering
799 undersized+degraded+peered
464 stale+down
418 stale+active+undersized+degraded
214 remapped+peering
157 unknown
128 stale+peering
117 stale+remapped+peering
101 stale+undersized+degraded+peered
57 stale+active+undersized+degraded+remapped+backfilling
35 down+remapped
26 stale+undersized+degraded+remapped+backfilling+peered
23 undersized+degraded+remapped+backfilling+peered
14 active+clean+scrubbing+deep
9 stale+active+undersized+degraded+remapped+backfill_wait
7 active+recovering+undersized+degraded
7 stale+active+recovering+undersized+degraded
6 active+undersized+degraded+remapped+backfilling
6 active+undersized
5 active+undersized+degraded+remapped+backfill_wait
5 stale+remapped
4 stale+activating+undersized+degraded
3 active+undersized+remapped
3 stale+undersized+degraded+remapped+backfill_wait+peered
1 activating+undersized+degraded
1 activating+undersized+degraded+remapped
1 undersized+degraded+remapped+backfill_wait+peered
1 stale+active+clean
1 active+recovering
1 stale+down+remapped
1 undersized+peered
1 active+undersized+degraded+remapped
1 active+clean+scrubbing
1 active+clean+remapped
1 active+recovering+degraded
io:
client: 1.8 MiB/s rd, 18 MiB/s wr, 409 op/s rd, 796 op/s wr
Thanks for any hints!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]