Howdy — I’ve had a failure on a small, Dumpling (0.67.4) cluster running on Ubuntu 13.10 machines. I had three OSD nodes (running 6 OSDs each), and lost two of them in a beautiful failure. One of these nodes even went so far as to scramble the XFS filesystems of my OSD disks (I’m curious if it has some bad DIMMs).
Anyway, the thing is: I’m okay with losing the data, this was a test setup and I want to take this opportunity to learn from the recovery process. I’m now stuck in ‘HEALTH_ERR’ and want to get back to ‘HEALTH_OK’ without just reinitializing the cluster. My OSD map seems correct, I’ve done scrubs (deep, and normal) at the PG and OSD levels. ‘ceph -s’ shows that I have 47 unfound objects still after I told ceph to ‘mark_unfound_lost’. The remaining 47 PGs tell me that they "haven't probed all sources, not marking lost”. Two days have passed at this point, and I’d just like to get my cluster back to working and deal with the object loss (which seems located to a single pool). How do I move forward from here, if at all? Do I ‘force_create_pg’ the PGs containing my unfound objects? > # ceph health detail | grep "unfound" | grep "^pg" > pg 4.ffe is active+recovering, acting [7,26], 3 unfound > pg 4.feb is active+recovering, acting [10,23], 1 unfound > pg 4.fa6 is active+recovery_wait, acting [11,25], 2 unfound > pg 4.f61 is active+recovering, acting [9,26], 1 unfound > pg 4.f2d is active+recovering, acting [8,22], 1 unfound > pg 4.ef5 is active+recovering, acting [6,22], 1 unfound > pg 4.e9c is active+recovering, acting [7,24], 1 unfound > pg 4.e12 is active+recovering, acting [7,22], 1 unfound > pg 4.e0e is active+recovering, acting [9,24], 1 unfound > pg 4.ddc is active+recovering, acting [10,26], 1 unfound > pg 4.d95 is active+recovering, acting [10,25], 1 unfound > pg 4.ccf is active+recovering, acting [10,24], 1 unfound > pg 4.c84 is active+recovering, acting [6,22], 2 unfound > pg 4.c4e is active+recovering, acting [10,23], 1 unfound > pg 4.bca is active+recovering, acting [6,26], 1 unfound > pg 4.bbf is active+recovering, acting [8,26], 1 unfound > pg 4.b5e is active+recovering, acting [6,26], 1 unfound > pg 4.ae1 is active+recovering, acting [8,26], 1 unfound > pg 4.a9c is active+recovering, acting [7,23], 1 unfound > pg 4.a39 is active+recovering, acting [10,24], 1 unfound > pg 4.85f is active+recovering, acting [9,25], 1 unfound > pg 4.83b is active+recovering, acting [10,25], 1 unfound > pg 4.7b8 is active+recovering, acting [7,26], 1 unfound > pg 4.758 is active+recovering, acting [8,23], 1 unfound > pg 4.740 is active+recovery_wait, acting [11,21], 1 unfound > pg 4.6f6 is active+recovering, acting [10,26], 2 unfound > pg 4.68d is active+recovering, acting [8,24], 2 unfound > pg 4.635 is active+recovery_wait, acting [11,22], 1 unfound > pg 4.60f is active+recovering, acting [6,22], 1 unfound > pg 4.603 is active+recovering, acting [9,24], 1 unfound > pg 4.5e1 is active+recovering, acting [10,25], 1 unfound > pg 4.579 is active+recovering, acting [9,21], 1 unfound > pg 4.56a is active+recovering, acting [7,24], 1 unfound > pg 4.519 is active+recovering, acting [9,21], 1 unfound > pg 4.435 is active+recovering, acting [6,26], 1 unfound > pg 4.42e is active+recovering, acting [7,22], 1 unfound > pg 4.30b is active+recovering, acting [7,25], 2 unfound > pg 4.1bb is active+recovery_wait, acting [11,22], 1 unfound > pg 4.178 is active+recovering, acting [8,23], 1 unfound > pg 4.43 is active+recovering, acting [9,23], 1 unfound Any help would be greatly appreciated. ./JRH _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
