[ceph-users] PG Recovery: HEALTH_ERR to HEALTH_OK

Jason Harley Tue, 03 Jun 2014 12:46:24 -0700

Howdy —

I’ve had a failure on a small, Dumpling (0.67.4) cluster running on Ubuntu 
13.10 machines.  I had three OSD nodes (running 6 OSDs each), and lost two of 
them in a beautiful failure.  One of these nodes even went so far as to 
scramble the XFS filesystems of my OSD disks (I’m curious if it has some bad 
DIMMs).


Anyway, the thing is: I’m okay with losing the data, this was a test setup and 
I want to take this opportunity to learn from the recovery process.  I’m now 
stuck in ‘HEALTH_ERR’ and want to get back to ‘HEALTH_OK’ without just 
reinitializing the cluster.

My OSD map seems correct, I’ve done scrubs (deep, and normal) at the PG and OSD 
levels.  ‘ceph -s’ shows that I have 47 unfound objects still after I told ceph 
to ‘mark_unfound_lost’.  The remaining 47 PGs tell me that they "haven't probed 
all sources, not marking lost”.  Two days have passed at this point, and I’d 
just like to get my cluster back to working and deal with the object loss 
(which seems located to a single pool).

How do I move forward from here, if at all?  Do I ‘force_create_pg’ the PGs 
containing my unfound objects?

> # ceph health detail | grep "unfound" | grep "^pg"
> pg 4.ffe is active+recovering, acting [7,26], 3 unfound
> pg 4.feb is active+recovering, acting [10,23], 1 unfound
> pg 4.fa6 is active+recovery_wait, acting [11,25], 2 unfound
> pg 4.f61 is active+recovering, acting [9,26], 1 unfound
> pg 4.f2d is active+recovering, acting [8,22], 1 unfound
> pg 4.ef5 is active+recovering, acting [6,22], 1 unfound
> pg 4.e9c is active+recovering, acting [7,24], 1 unfound
> pg 4.e12 is active+recovering, acting [7,22], 1 unfound
> pg 4.e0e is active+recovering, acting [9,24], 1 unfound
> pg 4.ddc is active+recovering, acting [10,26], 1 unfound
> pg 4.d95 is active+recovering, acting [10,25], 1 unfound
> pg 4.ccf is active+recovering, acting [10,24], 1 unfound
> pg 4.c84 is active+recovering, acting [6,22], 2 unfound
> pg 4.c4e is active+recovering, acting [10,23], 1 unfound
> pg 4.bca is active+recovering, acting [6,26], 1 unfound
> pg 4.bbf is active+recovering, acting [8,26], 1 unfound
> pg 4.b5e is active+recovering, acting [6,26], 1 unfound
> pg 4.ae1 is active+recovering, acting [8,26], 1 unfound
> pg 4.a9c is active+recovering, acting [7,23], 1 unfound
> pg 4.a39 is active+recovering, acting [10,24], 1 unfound
> pg 4.85f is active+recovering, acting [9,25], 1 unfound
> pg 4.83b is active+recovering, acting [10,25], 1 unfound
> pg 4.7b8 is active+recovering, acting [7,26], 1 unfound
> pg 4.758 is active+recovering, acting [8,23], 1 unfound
> pg 4.740 is active+recovery_wait, acting [11,21], 1 unfound
> pg 4.6f6 is active+recovering, acting [10,26], 2 unfound
> pg 4.68d is active+recovering, acting [8,24], 2 unfound
> pg 4.635 is active+recovery_wait, acting [11,22], 1 unfound
> pg 4.60f is active+recovering, acting [6,22], 1 unfound
> pg 4.603 is active+recovering, acting [9,24], 1 unfound
> pg 4.5e1 is active+recovering, acting [10,25], 1 unfound
> pg 4.579 is active+recovering, acting [9,21], 1 unfound
> pg 4.56a is active+recovering, acting [7,24], 1 unfound
> pg 4.519 is active+recovering, acting [9,21], 1 unfound
> pg 4.435 is active+recovering, acting [6,26], 1 unfound
> pg 4.42e is active+recovering, acting [7,22], 1 unfound
> pg 4.30b is active+recovering, acting [7,25], 2 unfound
> pg 4.1bb is active+recovery_wait, acting [11,22], 1 unfound
> pg 4.178 is active+recovering, acting [8,23], 1 unfound
> pg 4.43 is active+recovering, acting [9,23], 1 unfound


Any help would be greatly appreciated.

./JRH



_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] PG Recovery: HEALTH_ERR to HEALTH_OK

Reply via email to