Re: [ceph-users] Stuck PGs blocked_by non-existent OSDs

Samuel Just Wed, 11 Mar 2015 12:09:18 -0700

For each of those pgs, you'll need to identify the pg copy you want tobe the winner and either1) Remove all of the other ones using ceph-objectstore-tool andhopefully the winner you left alone will allow the pg to recover and goactive.2) Export the winner using ceph-objectstore-tool, useceph-objectstore-tool to delete *all* copies of the pg, useforce_create_pg to recreate the pg empty, use ceph-objectstore-tool todo a rados import on the exported pg copy.

Also, the pgs which are still down still have replicas which need to bebrought back or marked lost.

-Sam


On 03/11/2015 07:29 AM, [email protected] wrote:

I'd like to not have to null them if possible, there's nothing
outlandishly valuable, its more the time to reprovision (users have
stuff on there, mainly testing but I have a nasty feeling some users
won't have backed up their test instances). When you say complicated
and fragile, could you expand?

Thanks again!
Joel

On Wed, Mar 11, 2015 at 1:21 PM, Samuel Just <[email protected]> wrote:

Ok, you lost all copies from an interval where the pgs went active. The
recovery from this is going to be complicated and fragile.  Are the pools
valuable?
-Sam


On 03/11/2015 03:35 AM, [email protected] wrote:

For clarity too, I've tried to drop the min_size before as suggested,
doesn't make a difference unfortunately

On Wed, Mar 11, 2015 at 9:50 AM, [email protected]
<[email protected]> wrote:

Sure thing, n.b. I increased pg count to see if it would help. Alas not.
:)

Thanks again!

health_detail
https://gist.github.com/199bab6d3a9fe30fbcae

osd_dump
https://gist.github.com/499178c542fa08cc33bb

osd_tree
https://gist.github.com/02b62b2501cbd684f9b2

Random selected queries:
queries/0.19.query
https://gist.github.com/f45fea7c85d6e665edf8
queries/1.a1.query
https://gist.github.com/dd68fbd5e862f94eb3be
queries/7.100.query
https://gist.github.com/d4fd1fb030c6f2b5e678
queries/7.467.query
https://gist.github.com/05dbcdc9ee089bd52d0c

On Tue, Mar 10, 2015 at 2:49 PM, Samuel Just <[email protected]> wrote:

Yeah, get a ceph pg query on one of the stuck ones.
-Sam

On Tue, 2015-03-10 at 14:41 +0000, [email protected] wrote:

Stuck unclean and stuck inactive. I can fire up a full query and
health dump somewhere useful if you want (full pg query info on ones
listed in health detail, tree, osd dump etc). There were blocked_by
operations that no longer exist after doing the OSD addition.

Side note, spent some time yesterday writing some bash to do this
programatically (might be useful to others, will throw on github)

On Tue, Mar 10, 2015 at 1:41 PM, Samuel Just <[email protected]> wrote:

What do you mean by "unblocked" but still "stuck"?
-Sam

On Mon, 2015-03-09 at 22:54 +0000, [email protected] wrote:

On Mon, Mar 9, 2015 at 2:28 PM, Samuel Just <[email protected]> wrote:

You'll probably have to recreate osds with the same ids (empty
ones),
let them boot, stop them, and mark them lost.  There is a feature in
the
tracker to improve this behavior:
http://tracker.ceph.com/issues/10976
-Sam

Thanks Sam, I've readded the OSDs, they became unblocked but there
are
still the same number of pgs stuck. I looked at them in some more
detail and it seems they all have num_bytes='0'. Tried a repair too,
for good measure. Still nothing I'm afraid.

Does this mean some underlying catastrophe has happened and they are
never going to recover? Following on, would that cause data loss.
There are no missing objects and I'm hoping there's appropriate
checksumming / replicas to balance that out, but now I'm not so sure.

Thanks again,
Joel


--
$ echo "kpfmAdpoofdufevq/dp/vl" | perl -pe 's/(.)/chr(ord($1)-1)/ge'


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Stuck PGs blocked_by non-existent OSDs

Reply via email to