I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition.
If you repair anything, you should probably force a deep-scrub on all the PGs on that disk. I think ceph osd deep-scrub <osdid> will do that, but you might have to manually grep ceph pg dump . Or you could just treat it like a failed disk, but re-use the disk. ceph-disk-prepare --zap-disk should take care of you. On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley <[email protected]> wrote: > I tried restarting all the osd's on that node, osd.70 was the only ceph > process that did not come back online. > > There is nothing in the ceph-osd log for osd.70. > > However I do see over 13,000 of these messages in the kern.log: > > Nov 6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1): > xfs_log_force: error 5 returned. > > Does anyone have any suggestions on how I might be able to get this HD > back in the cluster (or whether or not it is worth even trying). > > Thanks, > > Shain > > Shain Miley | Manager of Systems and Infrastructure, Digital Media | > [email protected] | 202.513.3649 > > ________________________________________ > From: Shain Miley [[email protected]] > Sent: Tuesday, November 04, 2014 3:55 PM > To: [email protected] > Subject: osd down > > Hello, > > We are running ceph version 0.80.5 with 108 osd's. > > Today I noticed that one of the osd's is down: > > root@hqceph1:/var/log/ceph# ceph -s > cluster 504b5794-34bd-44e7-a8c3-0494cf800c23 > health HEALTH_WARN crush map has legacy tunables > monmap e1: 3 mons at > {hqceph1= > 10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0}, > election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3 > osdmap e7119: 108 osds: 107 up, 107 in > pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects > 216 TB used, 171 TB / 388 TB avail > 3204 active+clean > 4 active+clean+scrubbing > client io 4079 kB/s wr, 8 op/s > > > Using osd dump I determined that it is osd number 70: > > osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913 > last_clean_interval [488,2665) 10.35.1.217:6814/22440 > 10.35.1.217:6820/22440 10.35.1.217:6824/22440 10.35.1.217:6830/22440 > autoout,exists 5dbd4a14-5045-490e-859b-15533cd67568 > > > Looking at that node, the drive is still mounted and I did not see any > errors in any of the system logs, and the raid level status shows the > drive as up and healthy, etc. > > > root@hqosd6:~# df -h |grep 70 > /dev/sdl1 3.7T 1.9T 1.9T 51% /var/lib/ceph/osd/ceph-70 > > > I was hoping that someone might be able to advise me on the next course > of action (can I add the osd back in?, should I replace the drive > altogether, etc) > > I have attached the osd log to this email. > > Any suggestions would be great. > > Thanks, > > Shain > > > > > > > > > > > > > > > > -- > Shain Miley | Manager of Systems and Infrastructure, Digital Media | > [email protected] | 202.513.3649 > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
