Hello, Ceph users,
I wanted to install the recent kernel update on my OSD hosts
with CentOS 7, Ceph 13.2.5 Mimic. So I set a noout flag and ran
"yum -y update" on the first OSD host. This host has 8 bluestore OSDs
with data on HDDs and database on LVs of two SSDs (each SSD has 4 LVs
for OSD metadata).
Everything went OK, so I rebooted this host. After the OSD host
went back online, the cluster went from HEALTH_WARN (noout flag set)
to HEALTH_ERR, and started to rebalance itself, with reportedly almost 60 %
objects misplaced, and some of them degraded. And, of course backfill_toofull:
cluster:
health: HEALTH_ERR
2300616/3975384 objects misplaced (57.872%)
Degraded data redundancy: 74263/3975384 objects degraded (1.868%),
146 pgs degraded, 122 pgs undersized
Degraded data redundancy (low space): 44 pgs backfill_toofull
services:
mon: 3 daemons, quorum stratus1,stratus2,stratus3
mgr: stratus3(active), standbys: stratus1, stratus2
osd: 44 osds: 44 up, 44 in; 2022 remapped pgs
rgw: 1 daemon active
data:
pools: 9 pools, 3360 pgs
objects: 1.33 M objects, 5.0 TiB
usage: 25 TiB used, 465 TiB / 490 TiB avail
pgs: 74263/3975384 objects degraded (1.868%)
2300616/3975384 objects misplaced (57.872%)
1739 active+remapped+backfill_wait
1329 active+clean
102 active+recovery_wait+remapped
76 active+undersized+degraded+remapped+backfill_wait
31 active+remapped+backfill_wait+backfill_toofull
30 active+recovery_wait+undersized+degraded+remapped
21 active+recovery_wait+degraded+remapped
8
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
6 active+recovery_wait+degraded
4 active+remapped+backfill_toofull
3 active+recovery_wait+undersized+degraded
3 active+remapped+backfilling
2 active+recovery_wait
2 active+recovering+undersized
1 active+clean+remapped
1 active+undersized+degraded+remapped+backfill_toofull
1 active+undersized+degraded+remapped+backfilling
1 active+recovering+undersized+remapped
io:
client: 681 B/s rd, 1013 KiB/s wr, 0 op/s rd, 32 op/s wr
recovery: 142 MiB/s, 93 objects/s
(note that I cleaned the noout flag afterwards). What is wrong with it?
Why did the cluster decided to rebalance itself?
I am keeping the rest of the OSD hosts unrebooted for now.
Thanks,
-Yenya
--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
laryross> As far as stealing... we call it sharing here. --from rcgroups
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com