Update:
Restarting other osds on the server that we took osds from seems to
have reduced the amount of unknown pgs down to 170. However the peering
and activating states seem to stay very long for these OSDs:
cluster:
id: 1ccd84f6-e362-4c50-9ffe-59436745e445
health: HEALTH_ERR
noscrub flag(s) set
1 nearfull osd(s)
1 pool(s) nearfull
Reduced data availability: 1043 pgs inactive, 745 pgs peering
Low space hindering backfill (add storage if this doesn't resolve
itself): 1 pg backfill_toofull
Degraded data redundancy: 2384069/105076299 objects degraded
(2.269%), 110 pgs degraded, 156 pgs undersized
4970 slow requests are blocked > 32 sec
41 stuck requests are blocked > 4096 sec
2909 slow ops, oldest one blocked for 2266 sec, daemons
[osd.0,osd.1,osd.11,osd.12,osd.14,osd.15,osd.16,osd.18,osd.19,osd.2]... have
slow ops.
services:
mon: 3 daemons, quorum black1,black2,black3 (age 2h)
mgr: black2(active, since 2h), standbys: black1, black3
osd: 85 osds: 85 up, 82 in; 90 remapped pgs
flags noscrub
rgw: 1 daemon active (admin)
data:
pools: 12 pools, 3000 pgs
objects: 35.03M objects, 133 TiB
usage: 401 TiB used, 165 TiB / 566 TiB avail
pgs: 5.667% pgs unknown
29.200% pgs not active
2384069/105076299 objects degraded (2.269%)
330786/105076299 objects misplaced (0.315%)
1776 active+clean
745 peering
170 unknown
79 active+remapped+backfill_wait
68 active+undersized+degraded
53 activating+undersized
43 activating
30 activating+undersized+degraded
10 active+remapped+backfilling
6 active+recovery_wait+degraded
5 activating+degraded
5 active+recovery_wait
4 active+undersized
4 active+clean+scrubbing+deep
1 active+recovery_wait+undersized+degraded
1 active+remapped+backfill_toofull
io:
client: 44 MiB/s rd, 4.2 MiB/s wr, 991 op/s rd, 389 op/s wr
recovery: 71 MiB/s, 18 objects/s
Nico Schottelius <[email protected]> writes:
> Hello,
>
> after having moved 4 ssds to another host (+ the ceph tell hanging issue
> - see previous mail), we ran into 241 unknown pgs:
>
> cluster:
> id: 1ccd84f6-e362-4c50-9ffe-59436745e445
> health: HEALTH_WARN
> noscrub flag(s) set
> 2 nearfull osd(s)
> 1 pool(s) nearfull
> Reduced data availability: 241 pgs inactive
> 1532 slow requests are blocked > 32 sec
> 789 slow ops, oldest one blocked for 1949 sec, daemons
> [osd.12,osd.14,osd.2,osd.20,osd.23,osd.25,osd.3,osd.33,osd.35,osd.50]... have
> slow ops.
>
> services:
> mon: 3 daemons, quorum black1,black2,black3 (age 97m)
> mgr: black2(active, since 96m), standbys: black1, black3
> osd: 85 osds: 85 up, 82 in; 118 remapped pgs
> flags noscrub
> rgw: 1 daemon active (admin)
>
> data:
> pools: 12 pools, 3000 pgs
> objects: 33.96M objects, 129 TiB
> usage: 388 TiB used, 159 TiB / 548 TiB avail
> pgs: 8.033% pgs unknown
> 409151/101874117 objects misplaced (0.402%)
> 2634 active+clean
> 241 unknown
> 107 active+remapped+backfill_wait
> 11 active+remapped+backfilling
> 7 active+clean+scrubbing+deep
>
> io:
> client: 91 MiB/s rd, 28 MiB/s wr, 1.76k op/s rd, 686 op/s wr
> recovery: 67 MiB/s, 17 objects/s
>
> This used to be around 700+ unknown, however these 241 are stuck in this
> state for more than 1h. Below is a sample of pgs from "ceph pg dump
> all | grep unknown"
>
>
> 2.7f7 0 0 0 0 0 0
> 0 0 0 0 unknown 2020-09-22
> 19:03:00.694873 0'0 0:0 [] -1
> [] -1 0'0 2020-09-22 19:03:00.694873 0'0
> 2020-09-22 19:03:00.694873 0
> 2.7c7 0 0 0 0 0 0
> 0 0 0 0 unknown 2020-09-22
> 19:03:00.694873 0'0 0:0 [] -1
> [] -1 0'0 2020-09-22 19:03:00.694873 0'0
> 2020-09-22 19:03:00.694873 0
> 2.7c2 0 0 0 0 0 0
> 0 0 0 0 unknown 2020-09-22
> 19:03:00.694873 0'0 0:0 [] -1
> [] -1 0'0 2020-09-22 19:03:00.694873 0'0
> 2020-09-22 19:03:00.694873 0
> 2.7ab 0 0 0 0 0 0
> 0 0 0 0 unknown 2020-09-22
> 19:03:00.694873 0'0 0:0 [] -1
> [] -1 0'0 2020-09-22 19:03:00.694873 0'0
> 2020-09-22 19:03:00.694873 0
> 2.78b 0 0 0 0 0 0
> 0 0 0 0 unknown 2020-09-22
> 19:03:00.694873 0'0 0:0 [] -1
> [] -1 0'0 2020-09-22 19:03:00.694873 0'0
> 2020-09-22 19:03:00.694873 0
> 2.788 0 0 0 0 0 0
> 0 0 0 0 unknown 2020-09-22
> 19:03:00.694873 0'0 0:0 [] -1
> [] -1 0'0 2020-09-22 19:03:00.694873 0'0
> 2020-09-22 19:03:00.694873 0
> 2.76e 0
>
> Using ceph pg 2.7f7 query hangs.
>
> We checked and one server did have an incorrect MTU setting (9204
> instead of the correct 9000), but that was fixed some hours ago.
>
> Does anyone have a hint on how to find those unknown osds?
>
> Version wise this is 14.2.9:
>
> [20:42:20] black2.place6:~# ceph versions
> {
> "mon": {
> "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0)
> nautilus (stable)": 3
> },
> "mgr": {
> "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0)
> nautilus (stable)": 3
> },
> "osd": {
> "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0)
> nautilus (stable)": 85
> },
> "mds": {},
> "rgw": {
> "ceph version 20200428-923-g4004f081ec
> (4004f081ec047d60e84d76c2dad6f31e2ac44484) nautilus (stable)": 1
> },
> "overall": {
> "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0)
> nautilus (stable)": 91,
> "ceph version 20200428-923-g4004f081ec
> (4004f081ec047d60e84d76c2dad6f31e2ac44484) nautilus (stable)": 1
> }
> }
>
> From ceph health detail:
>
> [20:42:58] black2.place6:~# ceph health detail
> HEALTH_WARN noscrub flag(s) set; 2 nearfull osd(s); 1 pool(s) nearfull;
> Reduced data availability: 241 pgs inactive; 1575 slow requests are blocked >
> 32 sec; 751 slow ops, oldest one blocked for 1986 sec, daemons
> [osd.12,osd.14,osd.2,osd.20,osd.23,osd.25,osd.3,osd.31,osd.33,osd.35]... have
> slow ops.
> OSDMAP_FLAGS noscrub flag(s) set
> OSD_NEARFULL 2 nearfull osd(s)
> osd.36 is near full
> osd.54 is near full
> POOL_NEARFULL 1 pool(s) nearfull
> pool 'ssd' is nearfull
> PG_AVAILABILITY Reduced data availability: 241 pgs inactive
> pg 2.82 is stuck inactive for 6027.042489, current state unknown, last
> acting []
> pg 2.88 is stuck inactive for 6027.042489, current state unknown, last
> acting []
> ...
> pg 19.6e is stuck inactive for 6027.042489, current state unknown, last
> acting []
> pg 20.69 is stuck inactive for 6027.042489, current state unknown, last
> acting []
>
>
> As can be seen, multiple pools are affected even though most missing pgs
> are from pool 2.
>
> Best regards,
>
> Nico
--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]