[ceph-users] unfound objects - why and how to recover ? (bonus : jewel logs)

SCHAER Frederic Fri, 27 May 2016 07:08:09 -0700

Hi,

--
First, let me start with the bonus...
I migrated from hammer => jewel and followed the migration instructions... but 
migrations instructions are missing this :
#chown  -R ceph:ceph /var/log/ceph
I just discoved this was the reason I found no log nowhere about my current 
issue :/
--


This is maybe the 3rd time this happens to me ... This time I'd like to try to 
understand what happens.

So. ceph-10.2.0-0.el7.x86_64+Cent0S 7.2 here.
Ceph health was happy, but any rbd operation was hanging - hence : ceph was 
hung, and so were the test VMs running on it.

I placed my VM in an EC pool on top of which I overlayed an RBD pool with SSDs.
The EC pool is defined as being a 3+1 pool, with 5 hosts hosting the OSDs (and 
the failure domain is set to hosts)

"Ceph -w" wasn't displaying new status lines as usual, but ceph health (detail) 
wasn't saying anything would be wrong.
After looking at one node, I found that ceph logs were empty on one node, so I 
decided to restart the OSDs on that one using : systemctl restart ceph-osd@*

After I did that, ceph -w got to life again , but telling me there was a dead 
MON - which I restarted too.
I watched some kind of recovery happening, and after a few seconds/minutes, I 
now see :

[root@ceph0 ~]# ceph health detail
HEALTH_WARN 4 pgs degraded; 3 pgs recovering; 1 pgs recovery_wait; 4 pgs stuck 
unclean; recovery 57/373846 objects degraded (0.015%); recovery 57/110920 
unfound (0.051%)
pg 691.65 is stuck unclean for 310704.556119, current state 
active+recovery_wait+degraded, last acting [44,99,69,9]
pg 691.1e5 is stuck unclean for 493631.370697, current state 
active+recovering+degraded, last acting [77,43,20,99]
pg 691.12a is stuck unclean for 14521.475478, current state 
active+recovering+degraded, last acting [42,56,7,106]
pg 691.165 is stuck unclean for 14521.474525, current state 
active+recovering+degraded, last acting [21,71,24,117]
pg 691.165 is active+recovering+degraded, acting [21,71,24,117], 15 unfound
pg 691.12a is active+recovering+degraded, acting [42,56,7,106], 1 unfound
pg 691.1e5 is active+recovering+degraded, acting [77,43,20,99], 2 unfound
pg 691.65 is active+recovery_wait+degraded, acting [44,99,69,9], 39 unfound
recovery 57/373846 objects degraded (0.015%)
recovery 57/110920 unfound (0.051%)

Damn.
Last time this happened, I was forced to declare lost the PGs in order to 
recover a "healthy" ceph, because ceph does not want to revert PGs in EC pools. 
But one of the VMs started hanging randomly on disk IOs...
This same VM is now down, and I can't remove its disk from rbd, it's hanging at 
99% - I could work that around by renaming the file and re-installing the VM on 
a new disk, but anyway, I'd like to understand+fix+make sure this does not 
happen again.
We sometimes suffer power cuts here : if restarting daemons kills ceph data, I 
cannot think of what would happen in case of power cut...

Back to the unfound objects. I have no OSD down that would be in the cluster 
(only 1 down, and I put it myself down - OSD.46 - , but set its weight to 0 
last week)
I can query the PGs, but I don't understand what I see in there.
For instance :

#ceph pg 691.65 query
(...)
                "num_objects_missing": 0,
                "num_objects_degraded": 39,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 39,
                "num_objects_dirty": 138,

And then for 2 peers I see :
                "state": "active+undersized+degraded", ## undersized ???
(...)
                    "num_objects_missing": 0,
                    "num_objects_degraded": 138,
                    "num_objects_misplaced": 138,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 138,
                "blocked_by": [],
                "up_primary": 44,
                "acting_primary": 44


If I look at the "missing" objects, I can see something on some OSDs :
# ceph pg 691.165 list_missing
(...)
{
            "oid": {
                "oid": "rbd_data.8de32431bd7b7.0000000000000ea7",
                "key": "",
                "snapid": -2,
                "hash": 971513189,
                "max": 0,
                "pool": 691,
                "namespace": ""
            },
            "need": "26521'22595",
            "have": "25922'22575",
            "locations": []
        }

All of the missing objects have this "need/have" discrepancy.

I can see such objects in a "691.165" directory on secondary OSDs, but I do not 
see any 691.165 directory on the primary OSD (44)... ?
For instance :
[root@ceph0 ~]# ll 
/var/lib/ceph/osd/ceph-21/current/691.165s0_head/*8de32431bd7b7.0000000000000ea7*
-rw-r--r-- 1 ceph ceph 1399392 May 15 13:18 
/var/lib/ceph/osd/ceph-21/current/691.165s0_head/rbd\udata.8de32431bd7b7.0000000000000ea7__head_39E81D65__2b3_5843_0
-rw-r--r-- 1 ceph ceph 1399392 May 27 11:07 
/var/lib/ceph/osd/ceph-21/current/691.165s0_head/rbd\udata.8de32431bd7b7.0000000000000ea7__head_39E81D65__2b3_ffffffffffffffff_0

Even so : assuming I would have lost data on that OSD 44 (how ??), I would 
assume ceph would be able to reconstruct the missing data/PG thanks to the 
erasure codes/replica for RBD , it looks like it's not willing to ??
I already know that telling ceph to forget about the lost PGs is not a good 
idea, as it will cause the VMs using them to hang afterwards... and I'd prefer  
seeing ceph as a rock-solid solution allowing one to recover from such "usual" 
operations... ?

If anyone got ideas, I'd be happy ... should I kill osd.44 for good and 
recreate it ?

Thanks

P.S : I already tried to :

"ceph tell osd.44 injectargs --debug-osd 0/5 --debug-filestore 0/5"
Or
"ceph tell osd.44 injectargs --debug-osd 20/20 --debug-filestore 20/20"

PS : I tried this before I found the bonus at the start of this email...

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] unfound objects - why and how to recover ? (bonus : jewel logs)

Reply via email to