Re: [ceph-users] Another OSD broken today. How can I recover it?

Ronny Aasen Mon, 04 Dec 2017 03:22:41 -0800

On 04. des. 2017 10:22, Gonzalo Aguilar Delgado wrote:

Hello,
Things are going worse every day.


ceph -w
     cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
      health HEALTH_ERR
             1 pgs are stuck inactive for more than 300 seconds
             8 pgs inconsistent
             1 pgs repair
             1 pgs stale
             1 pgs stuck stale
recovery 20266198323167232/288980 objects degraded(7013010700798.405%)
             37154696925806624 scrub errors
             no legacy OSD present but 'sortbitwise' flag is not set
But I'm finally finding time to recover. The disk seems to be correct,no smart errors and everything looks fine just ceph not starting. TodayI started to look for the ceph-objectstore-tool. That I don't reallyknow much.
It just works nice. No crash as expected like on the OSD.
So I'm lost. Since both OSD and ceph objectstore tool use same backendhow is this posible?
Can someone help me on fixing this, please?

this line seems quite insane:
recovery 20266198323167232/288980 objects degraded (7013010700798.405%)

there is obviously something wrong in your cluster. once the defect osdid down/out does the cluster eventually heal to HEALTH_OK ?


you should start by reading and understanding this page.
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

also in order to get assistance you need to provide a lot more detail.

how many nodes, how many osd's per node. what kinf of nodes cpu/ram.what kind of networking setup.


show the output from
ceph -s
ceph osd tree
ceph osd pool ls detail
ceph health detail

since you are systematically loosing osd's i would start by checking thetimestamp in the defect osd for when it died.doublecheck your clock sync settingts that all servers are timesyncronized and then check all logs for the time in question.


especialy dmesg, did OOM killer do something ? was networking flaky ?
mon logs ?  did they complain about the osd in some fashion ?

also since you fail to start the osd again there is probably somecorruption going on. bump the log for that osd in the nodes ceph.conf,something like


[osd.XX]
debug osd = 20

rename the log for the osd so you have a fresh file. and try to startthe osd once. put the log on some pastebin and send the url.readhttp://ceph.com/planet/how-to-increase-debug-levels-and-harvest-a-detailed-osd-log/for details.

generally: try to make it easy for people to help you without having todrag details out of you. If you can collect all of the above on apastebin like http://paste.debian.net/ instead of piecing it togetherfrom 3-4 different email threads, you will find a lot more eyeballswilling to give it a look.




good luck and kind regards
Ronny Aasen



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another OSD broken today. How can I recover it?

Reply via email to