I'm presuming this is the correct list (rather than the -devel list) please
correct me if I'm wrong there.

I setup ceph (0.56.4) a few months ago with two disk servers and one
dedicated monitor. The disk servers also have monitors, so there are a
total of 3 monitors for the cluster. Each of the disk servers has 8 OSDs.

I didn't actually get a 'ceph osd tree' output from that, but
cutting-and-pasting relevant parts from what I have now, it probably looked
like this:

# id weight type name up/down reweight
-1 16 root default
-3 16 rack unknownrack
-2 0 host leviathan
100 1 osd.100 up 1
101 1 osd.101 up 1
102 1 osd.102 up 1
103 1 osd.103 up 1
104 1 osd.104 up 1
105 1 osd.105 up 1
106 1 osd.106 up 1
107 1 osd.107 up 1
-4 8 host minotaur
200 1 osd.200 up 1
201 1 osd.201 up 1
202 1 osd.202 up 1
203 1 osd.203 up 1
204 1 osd.204 up 1
205 1 osd.205 up 1
206 1 osd.206 up 1
207 1 osd.207 up 1

A couple of weeks ago, for valid reasons that aren't relevant here, we
decided to repurpose one of the disk servers (leviathan) and replace the
ceph fileserver with some other hardware. I created a new server (aergia).
That changed the 'ceph osd tree' to this:

# id weight type name up/down reweight
-1 16 root default
-3 16 rack unknownrack
-2 0 host leviathan
100 1 osd.100 up 1
101 1 osd.101 up 1
102 1 osd.102 up 1
103 1 osd.103 up 1
104 1 osd.104 up 1
105 1 osd.105 up 1
106 1 osd.106 up 1
107 1 osd.107 up 1
-4 8 host minotaur
200 1 osd.200 up 1
201 1 osd.201 up 1
202 1 osd.202 up 1
203 1 osd.203 up 1
204 1 osd.204 up 1
205 1 osd.205 up 1
206 1 osd.206 up 1
207 1 osd.207 up 1
0 1 osd.0 up 1
1 1 osd.1 up 1
2 1 osd.2 up 1
3 1 osd.3 up 1
4 1 osd.4 up 1
5 1 osd.5 up 1
6 1 osd.6 up 1
7 1 osd.7 up 1

Everything was looking happy, so I began removing the OSDs on leviathan.
That's when the problems stared. 'ceph health detail' shows that there are
several pages that either existed only on that disk server, e.g.
pg 0.312 is stuck unclean since forever, current state
stale+active+degraded+remapped, last acting [103]
or pages that were only replicated back onto the same host, e.g.
pg 0.2f4 is stuck unclean since forever, current state
stale+active+remapped, last acting [106,101]

I brought leviathan back up, and I *think* everything is at least
responding now. But 'ceph health' still shows
HEALTH_WARN 302 pgs degraded; 810 pgs stale; 810 pgs stuck stale; 3562 pgs
stuck unclean; recovery 44951/2289634 degraded (1.963%)
...and it's been stuck there for a long time.

So my question is, how do I force data off the to-be-decommissioned server
safely and get back to "HEALTH_OK"?
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to