Hi,
It seems that my ceph cluster is in an erroneous state of which I cannot
see right now how to get out of.
The status is the following:
health HEALTH_WARN
25 pgs degraded
1 pgs stale
26 pgs stuck unclean
25 pgs undersized
recovery 23578/9450442 objects degraded (0.249%)
recovery 45/9450442 objects misplaced (0.000%)
crush map has legacy tunables (require bobtail, min is firefly)
monmap e17: 3 mons at x
election epoch 8550, quorum 0,1,2 store1,store3,store2
osdmap e66602: 68 osds: 68 up, 68 in; 1 remapped pgs
flags require_jewel_osds
pgmap v31433805: 4388 pgs, 8 pools, 18329 GB data, 4614 kobjects
36750 GB used, 61947 GB / 98697 GB avail
23578/9450442 objects degraded (0.249%)
45/9450442 objects misplaced (0.000%)
4362 active+clean
24 active+undersized+degraded
1 stale+active+undersized+degraded+remapped
1 active+remapped
I tried restarting all OSDs, to no avail, it actually made things a bit
worse.
>From a user point of view the cluster works perfectly (apart from that
stale pg, which fortunately hit the pool on which I keep swap images
only).
A little background: I made the mistake of creating the cluster with
size=2 pools, which I'm now in the process of rectifying, but that
requires some fiddling around. I also tried moving to more optimal
tunables (firefly), but the documentation is a bit optimistic
with the 'up to 10%' data movement - it was over 50% in my case, so I
reverted to bobtail immediately after I saw that number. I then started
reweighing the osds in anticipation of the size=3 bump, and I think that's
when this bug hit me.
Right now I have a pg (6.245) that cannot even be queried - the command
times out, or gives this output: https://atw.hu/~koszik/ceph/pg6.245
I queried a few other pgs that are acting up, but cannot see anything
suspicious, other than the fact they do not have a working peer:
https://atw.hu/~koszik/ceph/pg4.2ca
https://atw.hu/~koszik/ceph/pg4.2e4
Health details can be found here: https://atw.hu/~koszik/ceph/health
OSD tree: https://atw.hu/~koszik/ceph/tree (here the weight sum of
ssd/store3_ssd seems to be off, but that has been the case for quite some
time - not sure if it's related to any of this)
I tried setting debugging to 20/20 on some of the affected osds, but there
was nothing there that gave me any ideas on solving this. How should I
continue debugging this issue?
BTW, I'm runnig 10.2.5 on all of my osd/mon nodes.
Thanks,
Matyas
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com