Dear experts,
Recently, a disk for one of our OSDs was failure and caused osd down, after I recovered the disk and filesystem, I noticed two problems:

1. journal corruption, which causes osd failure from starting:

-2> 2014-05-28 22:21:19.592034 7f5c6ff437a0 1 journal _open /var/lib/ceph/osd/ceph-1/journal fd 20: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
-1> 2014-05-28 22:21:19.606611 7f5c6ff437a0 -1 journal Unable to read past sequence 595649608 but header indicates the journal has
 committed up through 595649647, journal is corrupt
0> 2014-05-28 22:21:19.608234 7f5c6ff437a0 -1 os/FileJournal.cc: In function 'bool FileJournal::read_entry(ceph::bufferlist&, uin
t64_t&, bool*)' thread 7f5c6ff437a0 time 2014-05-28 22:21:19.606625
os/FileJournal.cc: 1697: FAILED assert(0)


2. I guess I may use ceph-osd with "--mkjournal" option to fix journal corruption issue, but there is another thing that bothers me, which is, the previous osd daemon is staying in "D" state, so, it can't be terminated, but usually, when filesystem recovered, process should be able to leave D state, so, I am not sure what causes this and if I can ignore that without causing any bad consequence.


USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root 22465 11.1 1.3 1516668 343624 ? Dsl Feb03 18441:31 /usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf



BTW, look at lsof, process is sticking at several Input/Output errors, and osd client connections are staying in CLOSE_WAIT, e.g:


ceph-osd 22465 root 202u unknown /current/13.66_head/DIR_6/DIR_E/DIR_A/DIR_7/rbd\udata.112bf2dbc003c.00000000000b385b__head_5A7B7AE6__d (stat: Input/output error) ceph-osd 22465 root 203u unknown /current/3.5a_head/DIR_A/DIR_D/DIR_8/rb.0.13a5.2ae8944a.000000140899__head_13C9C8DA__3 (stat: Input/output error) ceph-osd 22465 root 204u unknown /current/3.4c_head/DIR_C/DIR_4/DIR_9/rb.0.13a5.2ae8944a.0000000809f1__head_3E44C94C__3 (stat: Input/output error) ceph-osd 22465 root 205u unknown /current/13.52_head/DIR_2/DIR_D/DIR_1/DIR_7/rbd\udata.112bf2dbc003c.00000000000b3920__head_C41071D2__d (stat: Input/output error) ceph-osd 22465 root 206u unknown /current/13.5c_head/DIR_C/DIR_D/DIR_B/DIR_6/rbd\udata.112bf2dbc003c.00000000000b3922__head_EE946BDC__d (stat: Input/output error) ceph-osd 22465 root 207u unknown /current/13.5a_head/DIR_A/DIR_D/DIR_C/DIR_B/rbd\udata.112bf2dbc003c.00000000000b3934__head_031BBCDA__d (stat: Input/output error) ceph-osd 22465 root 208u unknown /current/13.27_head/DIR_7/DIR_A/DIR_3/DIR_2/rbd\udata.112bf2dbc003c.00000000000b3928__head_A2CF23A7__d (stat: Input/output error) ceph-osd 22465 root 209u unknown /current/13.6f_head/DIR_F/DIR_6/DIR_8/DIR_8/rbd\udata.112bf2dbc003c.00000000000b392a__head_71AA886F__d (stat: Input/output error) ceph-osd 22465 root 210u unknown /current/13.66_head/DIR_6/DIR_E/DIR_E/DIR_2/rbd\udata.112bf2dbc003c.00000000000b392c__head_30B22EE6__d (stat: Input/output error) ceph-osd 22465 root 211u unknown /current/13.69_head/DIR_9/DIR_E/DIR_9/DIR_C/rbd\udata.112bf2dbc003c.00000000000b3932__head_9C85C9E9__d (stat: Input/output error) ceph-osd 22465 root 212u unknown /current/13.51_head/DIR_1/DIR_5/DIR_7/DIR_F/rbd\udata.112bf2dbc003c.00000000000b3700__head_9BE9F751__d (stat: Input/output error) ceph-osd 22465 root 213u unknown /current/13.33_head/DIR_3/DIR_3/DIR_5/DIR_D/rbd\udata.112bf2dbc003c.00000000000b372e__head_1033D533__d (stat: Input/output error) ceph-osd 22465 root 214u unknown /current/13.2b_head/DIR_B/DIR_2/DIR_0/DIR_8/rbd\udata.117042a014b22.0000000000004c31__head_1E6A802B__d (stat: Input/output error) ceph-osd 22465 root 215u unknown /current/13.41_head/DIR_1/DIR_4/DIR_A/DIR_7/rbd\udata.3353793a09e6.0000000000194810__head_ADA57A41__d (stat: Input/output error) ceph-osd 22465 root 216u unknown /current/13.5b_head/DIR_B/DIR_5/DIR_A/DIR_D/rbd\udata.3353793a09e6.00000000001936c6__head_C01BDA5B__d (stat: Input/output error) ceph-osd 22465 root 217u unknown /current/13.4b_head/DIR_B/DIR_4/DIR_4/DIR_C/rbd\udata.3353793a09e6.0000000000193773__head_014DC44B__d (stat: Input/output error)


In any case, it would be very grateful if you experts could shed me some light.

Our current ceph version is ceph-0.72.2-0.el6.x86_64
And, the filesystem backend is xfs with fiber direct attached storages.



Thanks in advance
&
Best regards,
Felix Lee ~

--
Felix Lee                               Academia Sinica Grid & Cloud.
Tel: +886-2-27898308
Office: Room P111, Institute of Physics, 128 Academia Road, Section 2, Nankang, Taipei 115, Taiwan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to