Dear all, Recently we found the problem in OSS that some threads might be hung when the server got heavy IO load. In this case, some clients will be evicted or refused by some OSTs and got the error messages as following:
Server side: May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. D umping the stack trace for debugging purposes: May 30 11:06:31 boss07 kernel: Lustre: Skipped 1 previous similar message May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71 May 30 11:06:31 boss07 kernel: May 30 11:06:31 boss07 kernel: Call Trace: May 30 11:06:31 boss07 kernel: [<ffffffff886f5d0e>] start_this_handle+0x301/0x3cb [jbd2] May 30 11:06:31 boss07 kernel: [<ffffffff800a09ca>] autoremove_wake_function+0x0/0x2e May 30 11:06:31 boss07 kernel: [<ffffffff886f5e83>] jbd2_journal_start+0xab/0xdf [jbd2] May 30 11:06:31 boss07 kernel: [<ffffffff888ce9b2>] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs] May 30 11:06:31 boss07 kernel: [<ffffffff88920551>] filter_version_get_check+0x91/0x2a0 [obdfilter] May 30 11:06:31 boss07 kernel: [<ffffffff80036cf4>] __lookup_hash+0x61/0x12f May 30 11:06:31 boss07 kernel: [<ffffffff8893108d>] filter_setattr_internal+0x90d/0x1de0 [obdfilter] May 30 11:06:31 boss07 kernel: [<ffffffff800e859b>] lookup_one_len+0x53/0x61 May 30 11:06:31 boss07 kernel: [<ffffffff88925452>] filter_fid2dentry+0x512/0x740 [obdfilter] May 30 11:06:31 boss07 kernel: [<ffffffff88924e27>] filter_fmd_get+0x2b7/0x320 [obdfilter] May 30 11:06:31 boss07 kernel: [<ffffffff8003027b>] __up_write+0x27/0xf2 May 30 11:06:31 boss07 kernel: [<ffffffff88932721>] filter_setattr+0x1c1/0x3b0 [obdfilter] May 30 11:06:31 boss07 kernel: [<ffffffff8882677a>] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff8881e658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff88822b05>] lustre_msg_get_version+0x35/0xf0 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff888b0abb>] ost_handle+0x25db/0x55b0 [ost] May 30 11:06:31 boss07 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28 May 30 11:06:31 boss07 kernel: [<ffffffff800767ae>] smp_send_reschedule+0x4e/0x53 May 30 11:06:31 boss07 kernel: [<ffffffff8883215a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff888328a8>] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff8008b3bd>] __wake_up_common+0x3e/0x68 May 30 11:06:31 boss07 kernel: [<ffffffff88833817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 May 30 11:06:31 boss07 kernel: [<ffffffff888328e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc] May 30 11:06:31 boss07 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 May 30 11:06:31 boss07 kernel: May 30 11:06:31 boss07 kernel: LustreError: dumping log to /tmp/lustre-log.1338347191.8011 Client side: May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.123@tcp. The ost_connect operation failed with -16 When you got this error message, you failed to run "ls", "df" ,"vi", "touch" and so on, which affect us to do anything in the file system. I think the ost_connect failure could report some error messages to users instead of causing any interactive actions stuck. Could someone give us some advice or any suggestions to solve this problem? Thank you very much in advance. Best Regards Qiulan Huang 2012-05-30 ==================================================================== Computing center,the Institute of High Energy Physics, China Huang, Qiulan Tel: (+86) 10 8823 6010-105 P.O. Box 918-7 Fax: (+86) 10 8823 6839 Beijing 100049 P.R. China Email: [email protected] =================================================================== _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
