Hi, I think you might hit this: http://jira.whamcloud.com/browse/LU-952 , you can find the patch from this ticket
Regards Liang On May 30, 2012, at 11:21 AM, huangql wrote: > Dear all, > > Recently we found the problem in OSS that some threads might be hung when the > server got heavy IO load. In this case, some clients will be evicted or > refused by some OSTs and got the error messages as following: > > Server side: > > May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was inactive > for 200.00s. The thread might be hung, or it might only be slow and will > resume later. D > umping the stack trace for debugging purposes: May 30 11:06:31 boss07 kernel: > Lustre: Skipped 1 previous similar message > May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71 > May 30 11:06:31 boss07 kernel: > May 30 11:06:31 boss07 kernel: Call Trace: > May 30 11:06:31 boss07 kernel: [<ffffffff886f5d0e>] > start_this_handle+0x301/0x3cb [jbd2] > May 30 11:06:31 boss07 kernel: [<ffffffff800a09ca>] > autoremove_wake_function+0x0/0x2e > May 30 11:06:31 boss07 kernel: [<ffffffff886f5e83>] > jbd2_journal_start+0xab/0xdf [jbd2] > May 30 11:06:31 boss07 kernel: [<ffffffff888ce9b2>] > fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs] > May 30 11:06:31 boss07 kernel: [<ffffffff88920551>] > filter_version_get_check+0x91/0x2a0 [obdfilter] > May 30 11:06:31 boss07 kernel: [<ffffffff80036cf4>] __lookup_hash+0x61/0x12f > May 30 11:06:31 boss07 kernel: [<ffffffff8893108d>] > filter_setattr_internal+0x90d/0x1de0 [obdfilter] > May 30 11:06:31 boss07 kernel: [<ffffffff800e859b>] lookup_one_len+0x53/0x61 > May 30 11:06:31 boss07 kernel: [<ffffffff88925452>] > filter_fid2dentry+0x512/0x740 [obdfilter] > May 30 11:06:31 boss07 kernel: [<ffffffff88924e27>] > filter_fmd_get+0x2b7/0x320 [obdfilter] > May 30 11:06:31 boss07 kernel: [<ffffffff8003027b>] __up_write+0x27/0xf2 > May 30 11:06:31 boss07 kernel: [<ffffffff88932721>] > filter_setattr+0x1c1/0x3b0 [obdfilter] > May 30 11:06:31 boss07 kernel: [<ffffffff8882677a>] > lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff8881e658>] > ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff88822b05>] > lustre_msg_get_version+0x35/0xf0 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff888b0abb>] ost_handle+0x25db/0x55b0 > [ost] > May 30 11:06:31 boss07 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28 > May 30 11:06:31 boss07 kernel: [<ffffffff800767ae>] > smp_send_reschedule+0x4e/0x53 > May 30 11:06:31 boss07 kernel: [<ffffffff8883215a>] > ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff888328a8>] > ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff8008b3bd>] > __wake_up_common+0x3e/0x68 > May 30 11:06:31 boss07 kernel: [<ffffffff88833817>] ptlrpc_main+0xf37/0x10f0 > [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 > May 30 11:06:31 boss07 kernel: [<ffffffff888328e0>] ptlrpc_main+0x0/0x10f0 > [ptlrpc] > May 30 11:06:31 boss07 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > May 30 11:06:31 boss07 kernel: > May 30 11:06:31 boss07 kernel: LustreError: dumping log to > /tmp/lustre-log.1338347191.8011 > > > Client side: > > May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred while > communicating with 192.168.50.123@tcp. The ost_connect operation failed with > -16 > > When you got this error message, you failed to run "ls", "df" ,"vi", "touch" > and so on, which affect us to do anything in the file system. > I think the ost_connect failure could report some error messages to users > instead of causing any interactive actions stuck. > > Could someone give us some advice or any suggestions to solve this problem? > > Thank you very much in advance. > > > Best Regards > Qiulan Huang > 2012-05-30 > ==================================================================== > Computing center,the Institute of High Energy Physics, China > Huang, Qiulan Tel: (+86) 10 8823 6010-105 > P.O. Box 918-7 Fax: (+86) 10 8823 6839 > Beijing 100049 P.R. China Email: [email protected] > =================================================================== > > >
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
