Dear  all,

Recently we found the problem in OSS that some threads might be hung when the 
server got heavy IO load. In this case, some clients will be evicted or refused 
by some OSTs and got the error messages as following:

Server side:

May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was inactive for 
200.00s. The thread might be hung, or it might only be slow and will resume 
later. D
umping the stack trace for debugging purposes: May 30 11:06:31 boss07 kernel: 
Lustre: Skipped 1 previous similar message
May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71 
May 30 11:06:31 boss07 kernel: 
May 30 11:06:31 boss07 kernel: Call Trace:
May 30 11:06:31 boss07 kernel:  [<ffffffff886f5d0e>] 
start_this_handle+0x301/0x3cb [jbd2]
May 30 11:06:31 boss07 kernel:  [<ffffffff800a09ca>] 
autoremove_wake_function+0x0/0x2e
May 30 11:06:31 boss07 kernel:  [<ffffffff886f5e83>] 
jbd2_journal_start+0xab/0xdf [jbd2]
May 30 11:06:31 boss07 kernel:  [<ffffffff888ce9b2>] 
fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs]
May 30 11:06:31 boss07 kernel:  [<ffffffff88920551>] 
filter_version_get_check+0x91/0x2a0 [obdfilter]
May 30 11:06:31 boss07 kernel:  [<ffffffff80036cf4>] __lookup_hash+0x61/0x12f
May 30 11:06:31 boss07 kernel:  [<ffffffff8893108d>] 
filter_setattr_internal+0x90d/0x1de0 [obdfilter]
May 30 11:06:31 boss07 kernel:  [<ffffffff800e859b>] lookup_one_len+0x53/0x61
May 30 11:06:31 boss07 kernel:  [<ffffffff88925452>] 
filter_fid2dentry+0x512/0x740 [obdfilter]
May 30 11:06:31 boss07 kernel:  [<ffffffff88924e27>] filter_fmd_get+0x2b7/0x320 
[obdfilter]
May 30 11:06:31 boss07 kernel:  [<ffffffff8003027b>] __up_write+0x27/0xf2
May 30 11:06:31 boss07 kernel:  [<ffffffff88932721>] filter_setattr+0x1c1/0x3b0 
[obdfilter]
May 30 11:06:31 boss07 kernel:  [<ffffffff8882677a>] 
lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
May 30 11:06:31 boss07 kernel:  [<ffffffff8881e658>] 
ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
May 30 11:06:31 boss07 kernel:  [<ffffffff88822b05>] 
lustre_msg_get_version+0x35/0xf0 [ptlrpc]
May 30 11:06:31 boss07 kernel:  [<ffffffff888b0abb>] ost_handle+0x25db/0x55b0 
[ost]
May 30 11:06:31 boss07 kernel:  [<ffffffff80150d56>] __next_cpu+0x19/0x28
May 30 11:06:31 boss07 kernel:  [<ffffffff800767ae>] 
smp_send_reschedule+0x4e/0x53
May 30 11:06:31 boss07 kernel:  [<ffffffff8883215a>] 
ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
May 30 11:06:31 boss07 kernel:  [<ffffffff888328a8>] 
ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]
May 30 11:06:31 boss07 kernel:  [<ffffffff8008b3bd>] __wake_up_common+0x3e/0x68
May 30 11:06:31 boss07 kernel:  [<ffffffff88833817>] ptlrpc_main+0xf37/0x10f0 
[ptlrpc]
May 30 11:06:31 boss07 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
May 30 11:06:31 boss07 kernel:  [<ffffffff888328e0>] ptlrpc_main+0x0/0x10f0 
[ptlrpc]
May 30 11:06:31 boss07 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
May 30 11:06:31 boss07 kernel:
May 30 11:06:31 boss07 kernel: LustreError: dumping log to 
/tmp/lustre-log.1338347191.8011


Client side:

May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred while 
communicating with 192.168.50.123@tcp. The ost_connect operation failed with -16

When you got this error message, you failed to run "ls", "df" ,"vi", "touch" 
and so on, which affect us to do anything in the file system.
I think the ost_connect failure could report some error messages to users 
instead of  causing any interactive actions stuck.

Could someone give us some advice or any suggestions to solve this problem?

Thank you very much in advance.


Best Regards
Qiulan Huang
2012-05-30
====================================================================
Computing center,the Institute of High Energy Physics, China
Huang, Qiulan                        Tel: (+86) 10 8823 6010-105
P.O. Box 918-7                       Fax: (+86) 10 8823 6839
Beijing 100049  P.R. China           Email: [email protected]
===================================================================     
                          


_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to