We have upgraded from Hammer to Jewel and then Luminous 12.2.2 as of today.
During the hammer upgrade to Jewel we lost two host servers and let the
cluster rebalance/recover, it ran out of space and stalled. We then added
three new host servers and then let the cluster rebalance/recover. During
that process, at some point, we ended up with 4 pgs not being able to be
repaired using "ceph pg repair xx.xx". I tried using ceph pg 11.720 query
and from what I can tell the missing information matches, but is being
blocked from being marked clean. I keep seeing references to the
ceph-object-store tool to use as an export/restore method, but I cannot find
details on a step by step process given the current predicament. It may
also be possible for us to just lose the data if it cant be extracted so we
can at least return the cluster to a healthy state. Any thoughts?
Ceph -s output:
cluster:
health: HEALTH_ERR
Reduced data availability: 4 pgs inactive, 4 pgs incomplete
Degraded data redundancy: 4 pgs unclean
4 stuck requests are blocked > 4096 sec
too many PGs per OSD (2549 > max 200)
services:
mon: 3 daemons, quorum ukpixmon1,ukpixmon2,ukpixmon3
mgr: ukpixmon1(active), standbys: ukpixmon3, ukpixmon2
osd: 43 osds: 43 up, 43 in
rgw: 3 daemons active
data:
pools: 12 pools, 37904 pgs
objects: 8148k objects, 10486 GB
usage: 21530 GB used, 135 TB / 156 TB avail
pgs: 0.011% pgs not active
37900 active+clean
4 incomplete
OSD TREE output:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 156.10268 root default
-2 32.57996 host osdhost1
0 3.62000 osd.0 up 1.00000 1.00000
1 3.62000 osd.1 up 1.00000 1.00000
2 3.62000 osd.2 up 1.00000 1.00000
3 3.62000 osd.3 up 1.00000 1.00000
4 3.62000 osd.4 up 1.00000 1.00000
5 3.62000 osd.5 up 1.00000 1.00000
6 3.62000 osd.6 up 1.00000 1.00000
7 3.62000 osd.7 up 1.00000 1.00000
8 3.62000 osd.8 up 1.00000 1.00000
-3 25.33997 host osdhost2
9 3.62000 osd.9 up 1.00000 1.00000
10 3.62000 osd.10 up 1.00000 1.00000
11 3.62000 osd.11 up 1.00000 1.00000
12 3.62000 osd.12 up 1.00000 1.00000
15 3.62000 osd.15 up 1.00000 1.00000
16 3.62000 osd.16 up 1.00000 1.00000
17 3.62000 osd.17 up 1.00000 1.00000
-8 32.72758 host osdhost6
14 3.63640 osd.14 up 1.00000 1.00000
21 3.63640 osd.21 up 1.00000 1.00000
23 3.63640 osd.23 up 1.00000 1.00000
26 3.63640 osd.26 up 1.00000 1.00000
32 3.63640 osd.32 up 1.00000 1.00000
33 3.63640 osd.33 up 1.00000 1.00000
34 3.63640 osd.34 up 1.00000 1.00000
35 3.63640 osd.35 up 1.00000 1.00000
36 3.63640 osd.36 up 1.00000 1.00000
-9 32.72758 host osdhost7
19 3.63640 osd.19 up 1.00000 1.00000
37 3.63640 osd.37 up 1.00000 1.00000
38 3.63640 osd.38 up 1.00000 1.00000
39 3.63640 osd.39 up 1.00000 1.00000
40 3.63640 osd.40 up 1.00000 1.00000
41 3.63640 osd.41 up 1.00000 1.00000
42 3.63640 osd.42 up 1.00000 1.00000
43 3.63640 osd.43 up 1.00000 1.00000
44 3.63640 osd.44 up 1.00000 1.00000
-7 32.72758 host osdhost8
20 3.63640 osd.20 up 1.00000 1.00000
45 3.63640 osd.45 up 1.00000 1.00000
46 3.63640 osd.46 up 1.00000 1.00000
47 3.63640 osd.47 up 1.00000 1.00000
48 3.63640 osd.48 up 1.00000 1.00000
49 3.63640 osd.49 up 1.00000 1.00000
50 3.63640 osd.50 up 1.00000 1.00000
51 3.63640 osd.51 up 1.00000 1.00000
52 3.63640 osd.52 up 1.00000 1.00000
Ceph health detail output:
HEALTH_ERR Reduced data availability: 4 pgs inactive, 4 pgs incomplete;
Degraded data redundancy: 4 pgs unclean; 4 stuck requests are blocked > 4096
sec; too many PGs per OSD (2549 > max 200)
PG_AVAILABILITY Reduced data availability: 4 pgs inactive, 4 pgs incomplete
pg 11.720 is incomplete, acting [21,10]
pg 11.9ab is incomplete, acting [14,2]
pg 11.9fb is incomplete, acting [32,43]
pg 11.c13 is incomplete, acting [42,26]
PG_DEGRADED Degraded data redundancy: 4 pgs unclean
pg 11.720 is stuck unclean since forever, current state incomplete, last
acting [21,10]
pg 11.9ab is stuck unclean since forever, current state incomplete, last
acting [14,2]
pg 11.9fb is stuck unclean since forever, current state incomplete, last
acting [32,43]
pg 11.c13 is stuck unclean since forever, current state incomplete, last
acting [42,26]
REQUEST_STUCK 4 stuck requests are blocked > 4096 sec
4 ops are blocked > 33554.4 sec
osds 21,26,32,42 have stuck requests > 33554.4 sec
TOO_MANY_PGS too many PGs per OSD (2549 > max 200)
-Brent
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com