On 04. des. 2017 10:22, Gonzalo Aguilar Delgado wrote:
Hello,

Things are going worse every day.


ceph -w
     cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
      health HEALTH_ERR
             1 pgs are stuck inactive for more than 300 seconds
             8 pgs inconsistent
             1 pgs repair
             1 pgs stale
             1 pgs stuck stale
            recovery 20266198323167232/288980 objects degraded (7013010700798.405%)
             37154696925806624 scrub errors
             no legacy OSD present but 'sortbitwise' flag is not set


But I'm finally finding time to recover. The disk seems to be correct, no smart errors and everything looks fine just ceph not starting. Today I started to look for the ceph-objectstore-tool. That I don't really know much.

It just works nice. No crash as expected like on the OSD.

So I'm lost. Since both OSD and ceph objectstore tool use same backend how is this posible?

Can someone help me on fixing this, please?



this line seems quite insane:
recovery 20266198323167232/288980 objects degraded (7013010700798.405%)

there is obviously something wrong in your cluster. once the defect osd id down/out does the cluster eventually heal to HEALTH_OK ?

you should start by reading and understanding this page.
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

also in order to get assistance you need to provide a lot more detail.
how many nodes, how many osd's per node. what kinf of nodes cpu/ram. what kind of networking setup.

show the output from
ceph -s
ceph osd tree
ceph osd pool ls detail
ceph health detail




since you are systematically loosing osd's i would start by checking the timestamp in the defect osd for when it died. doublecheck your clock sync settingts that all servers are time syncronized and then check all logs for the time in question.

especialy dmesg, did OOM killer do something ? was networking flaky ?
mon logs ?  did they complain about the osd in some fashion ?


also since you fail to start the osd again there is probably some corruption going on. bump the log for that osd in the nodes ceph.conf, something like

[osd.XX]
debug osd = 20

rename the log for the osd so you have a fresh file. and try to start the osd once. put the log on some pastebin and send the url. read http://ceph.com/planet/how-to-increase-debug-levels-and-harvest-a-detailed-osd-log/ for details.



generally: try to make it easy for people to help you without having to drag details out of you. If you can collect all of the above on a pastebin like http://paste.debian.net/ instead of piecing it together from 3-4 different email threads, you will find a lot more eyeballs willing to give it a look.



good luck and kind regards
Ronny Aasen



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to