I do…

In my case, I have collocated the MONs with some OSDs, and no later than 
Saturday when I lost data again, I found out that one of the MON+OSD nodes ran 
out of memory and started killing ceph-mon on that node…
At the same moment, all OSDs started to complain about not being able to see 
other OSDs on other machines.

I suspect that when the node runs out of memory, bad things happen with for 
instance the network (no memory : no network buffer ?). But I can’t explain the 
unfound objects, as in my case, same as yours, nodes did not crash, and 
ceph-osd did not crash neither – hence, I’m assuming no data was lost because 
of sudden disk poweroff for instance, or because of any kernel or raid 
controller cache…

For now, I’m considering moving the MONs onto dedicated nodes … hoping the out 
of memory was my issue.

De : ceph-users [mailto:[email protected]] De la part de Diego 
Castro
Envoyé : mercredi 1 juin 2016 10:25
À : ceph-users <[email protected]>
Objet : [ceph-users] OSD Restart results in "unfound objects"

Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon.
Today my cluster suddenly went unhealth with lots of stuck pg's  due unfound 
objects, no disks failures nor node crashes, it just went bad.

I managed to put the cluster on health state again by marking lost objects to 
delete "ceph pg <id> mark_unfound_lost delete".
Regarding the fact that i have no idea why the cluster gone bad, i realized 
restarting the osd' daemons to unlock stuck clients put the cluster on unhealth 
and pg gone stuck again due unfound objects.

Does anyone have this issue?

---
Diego Castro / The CloudFather
GetupCloud.com - Eliminamos a Gravidade
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to