I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition.

If you repair anything, you should probably force a deep-scrub on all the
PGs on that disk.  I think ceph osd deep-scrub <osdid> will do that, but
you might have to manually grep ceph pg dump .


Or you could just treat it like a failed disk, but re-use the disk.
ceph-disk-prepare
--zap-disk should take care of you.


On Thu, Nov 6, 2014 at 5:06 PM, Shain Miley <[email protected]> wrote:

> I tried restarting all the osd's on that node, osd.70 was the only ceph
> process that did not come back online.
>
> There is nothing in the ceph-osd log for osd.70.
>
> However I do see over 13,000 of these messages in the kern.log:
>
> Nov  6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1):
> xfs_log_force: error 5 returned.
>
> Does anyone have any suggestions on how I might be able to get this HD
> back in the cluster (or whether or not it is worth even trying).
>
> Thanks,
>
> Shain
>
> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
> [email protected] | 202.513.3649
>
> ________________________________________
> From: Shain Miley [[email protected]]
> Sent: Tuesday, November 04, 2014 3:55 PM
> To: [email protected]
> Subject: osd down
>
> Hello,
>
> We are running ceph version 0.80.5 with 108 osd's.
>
> Today I noticed that one of the osd's is down:
>
> root@hqceph1:/var/log/ceph# ceph -s
>      cluster 504b5794-34bd-44e7-a8c3-0494cf800c23
>       health HEALTH_WARN crush map has legacy tunables
>       monmap e1: 3 mons at
> {hqceph1=
> 10.35.1.201:6789/0,hqceph2=10.35.1.203:6789/0,hqceph3=10.35.1.205:6789/0},
> election epoch 146, quorum 0,1,2 hqceph1,hqceph2,hqceph3
>       osdmap e7119: 108 osds: 107 up, 107 in
>        pgmap v6729985: 3208 pgs, 17 pools, 81193 GB data, 21631 kobjects
>              216 TB used, 171 TB / 388 TB avail
>                  3204 active+clean
>                     4 active+clean+scrubbing
>    client io 4079 kB/s wr, 8 op/s
>
>
> Using osd dump I determined that it is osd number 70:
>
> osd.70 down out weight 0 up_from 2668 up_thru 6886 down_at 6913
> last_clean_interval [488,2665) 10.35.1.217:6814/22440
> 10.35.1.217:6820/22440 10.35.1.217:6824/22440 10.35.1.217:6830/22440
> autoout,exists 5dbd4a14-5045-490e-859b-15533cd67568
>
>
> Looking at that node, the drive is still mounted and I did not see any
> errors in any of the system logs, and the raid level status shows the
> drive as up and healthy, etc.
>
>
> root@hqosd6:~# df -h |grep 70
> /dev/sdl1       3.7T  1.9T  1.9T  51% /var/lib/ceph/osd/ceph-70
>
>
> I was hoping that someone might be able to advise me on the next course
> of action (can I add the osd back in?, should I replace the drive
> altogether, etc)
>
> I have attached the osd log to this email.
>
> Any suggestions would be great.
>
> Thanks,
>
> Shain
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> Shain Miley | Manager of Systems and Infrastructure, Digital Media |
> [email protected] | 202.513.3649
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to