Hi, Qiulan LU-952 is about a deadlock issue, was the quota enabled? You could try to disable quota and see if the problem is gone.
Thanks - Niu > Hi Zhen, > Many thanks to your prompt reply. I have disabled the > writhetrhough_cache and read_cache to see the problem but it still > hung thread when there is heavy IO. > May 31 08:38:43 boss33 kernel: LustreError: dumping log to > /tmp/lustre-log.1338424722.5303 > May 31 08:38:43 boss33 kernel: Pid: 5262, comm: ll_ost_io_48 > May 31 08:38:43 boss33 kernel: > May 31 08:38:43 boss33 kernel: Call Trace: > May 31 08:38:43 boss33 kernel: [<ffffffff887bd451>] > ksocknal_queue_tx_locked+0x451/0x490 [ksocklnd] > May 31 08:38:43 boss33 kernel: [<ffffffff800646ac>] __down_read+0x7a/0x92 > May 31 08:38:43 boss33 kernel: [<ffffffff889843df>] > ldiskfs_get_blocks+0x5f/0x2e0 [ldiskfs] > May 31 08:38:43 boss33 kernel: [<ffffffff889851a0>] > ldiskfs_get_block+0xc0/0x120 [ldiskfs] > May 31 08:38:43 boss33 kernel: [<ffffffff88981f60>] > ldiskfs_bmap+0x0/0xf0 [ldiskfs] > May 31 08:38:43 boss33 kernel: [<ffffffff80033615>] > generic_block_bmap+0x37/0x41 > May 31 08:38:43 boss33 kernel: [<ffffffff800341ad>] > mapping_tagged+0x3c/0x47 > May 31 08:38:43 boss33 kernel: [<ffffffff88981f88>] > ldiskfs_bmap+0x28/0xf0 [ldiskfs] > May 31 08:38:43 boss33 kernel: [<ffffffff88981f60>] > ldiskfs_bmap+0x0/0xf0 [ldiskfs] > May 31 08:38:43 boss33 kernel: [<ffffffff88a4a288>] > filter_commitrw_write+0x398/0x2be0 [obdfilter] > May 31 08:38:43 boss33 kernel: [<ffffffff889e6e5c>] > ost_checksum_bulk+0x30c/0x5b0 [ost] > May 31 08:38:43 boss33 kernel: [<ffffffff889e6c38>] > ost_checksum_bulk+0xe8/0x5b0 [ost] > May 31 08:38:43 boss33 kernel: [<ffffffff889edcf9>] > ost_brw_write+0x1c99/0x2480 [ost] > May 31 08:38:43 boss33 kernel: [<ffffffff8872e658>] > ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff886f98b0>] > target_committed_to_req+0x40/0x120 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff8008cf93>] > default_wake_function+0x0/0xe > May 31 08:38:43 boss33 kernel: [<ffffffff88732bc8>] > lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff889f108e>] > ost_handle+0x2bae/0x55b0 [ost] > May 31 08:38:43 boss33 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28 > May 31 08:38:43 boss33 kernel: [<ffffffff800767ae>] > smp_send_reschedule+0x4e/0x53 > May 31 08:38:43 boss33 kernel: [<ffffffff8874215a>] > ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff887428a8>] > ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff8008b3bd>] > __wake_up_common+0x3e/0x68 > May 31 08:38:43 boss33 kernel: [<ffffffff88743817>] > ptlrpc_main+0xf37/0x10f0 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 > May 31 08:38:43 boss33 kernel: [<ffffffff887428e0>] > ptlrpc_main+0x0/0x10f0 [ptlrpc] > May 31 08:38:43 boss33 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > May 31 08:38:43 boss33 kernel: > May 31 08:38:43 boss33 kernel: LustreError: dumping log to > /tmp/lustre-log.1338424722.5262 > May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow journal > start 48s due to heavy IO load > May 31 08:38:43 boss33 kernel: Lustre: Skipped 1 previous similar message > May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow brw_start > 48s due to heavy IO load > May 31 08:38:43 boss33 kernel: Lustre: Skipped 1 previous similar message > May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow journal > start 187s due to heavy IO load > May 31 08:38:43 boss33 kernel: Lustre: bes3fs-OST0064: slow brw_start > 187s due to heavy IO load > I have not patched the bug because the all servers is online. Could > you know how to deal with it without affecting users? > Thank you very much. > Cheers, > Qiulan > ==================================================================== > Computing center,the Institute of High Energy Physics, China > Huang, Qiulan Tel: (+86) 10 8823 6010-105 > P.O. Box 918-7 Fax: (+86) 10 8823 6839 > Beijing 100049 P.R. China Email: [email protected] > <mailto:[email protected]> > =================================================================== > 2012-05-31 > ------------------------------------------------------------------------ > huangql > ------------------------------------------------------------------------ > *发件人:* Liang Zhen > *发送时间:* 2012-05-30 19:12:15 > *收件人:* huangql > *抄送:* lustre-discuss; wc-discuss > *主题:* [SPAM] Re: [wc-discuss] The ost_connect operation failedwith -16 > Hi, I think you might hit this: > http://jira.whamcloud.com/browse/LU-952 , you can find the patch from > this ticket > > Regards > Liang > > On May 30, 2012, at 11:21 AM, huangql wrote: > >> Dear all, >> >> Recently we found the problem in OSS that some threads might be hung >> when the server got heavy IO load. In this case, some clients will be >> evicted or refused by some OSTs and got the error messages as following: >> >> Server side: >> >> May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was >> inactive for 200.00s. The thread might be hung, or it might only be >> slow and will resume later. D >> umping the stack trace for debugging purposes: May 30 11:06:31 boss07 >> kernel: Lustre: Skipped 1 previous similar message >> May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71 >> May 30 11:06:31 boss07 kernel: >> May 30 11:06:31 boss07 kernel: Call Trace: >> May 30 11:06:31 boss07 kernel: [<ffffffff886f5d0e>] >> start_this_handle+0x301/0x3cb [jbd2] >> May 30 11:06:31 boss07 kernel: [<ffffffff800a09ca>] >> autoremove_wake_function+0x0/0x2e >> May 30 11:06:31 boss07 kernel: [<ffffffff886f5e83>] >> jbd2_journal_start+0xab/0xdf [jbd2] >> May 30 11:06:31 boss07 kernel: [<ffffffff888ce9b2>] >> fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs] >> May 30 11:06:31 boss07 kernel: [<ffffffff88920551>] >> filter_version_get_check+0x91/0x2a0 [obdfilter] >> May 30 11:06:31 boss07 kernel: [<ffffffff80036cf4>] >> __lookup_hash+0x61/0x12f >> May 30 11:06:31 boss07 kernel: [<ffffffff8893108d>] >> filter_setattr_internal+0x90d/0x1de0 [obdfilter] >> May 30 11:06:31 boss07 kernel: [<ffffffff800e859b>] >> lookup_one_len+0x53/0x61 >> May 30 11:06:31 boss07 kernel: [<ffffffff88925452>] >> filter_fid2dentry+0x512/0x740 [obdfilter] >> May 30 11:06:31 boss07 kernel: [<ffffffff88924e27>] >> filter_fmd_get+0x2b7/0x320 [obdfilter] >> May 30 11:06:31 boss07 kernel: [<ffffffff8003027b>] __up_write+0x27/0xf2 >> May 30 11:06:31 boss07 kernel: [<ffffffff88932721>] >> filter_setattr+0x1c1/0x3b0 [obdfilter] >> May 30 11:06:31 boss07 kernel: [<ffffffff8882677a>] >> lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff8881e658>] >> ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff88822b05>] >> lustre_msg_get_version+0x35/0xf0 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff888b0abb>] >> ost_handle+0x25db/0x55b0 [ost] >> May 30 11:06:31 boss07 kernel: [<ffffffff80150d56>] __next_cpu+0x19/0x28 >> May 30 11:06:31 boss07 kernel: [<ffffffff800767ae>] >> smp_send_reschedule+0x4e/0x53 >> May 30 11:06:31 boss07 kernel: [<ffffffff8883215a>] >> ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff888328a8>] >> ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff8008b3bd>] >> __wake_up_common+0x3e/0x68 >> May 30 11:06:31 boss07 kernel: [<ffffffff88833817>] >> ptlrpc_main+0xf37/0x10f0 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 >> May 30 11:06:31 boss07 kernel: [<ffffffff888328e0>] >> ptlrpc_main+0x0/0x10f0 [ptlrpc] >> May 30 11:06:31 boss07 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 >> May 30 11:06:31 boss07 kernel: >> May 30 11:06:31 boss07 kernel: LustreError: dumping log to >> /tmp/lustre-log.1338347191.8011 >> >> >> Client side: >> >> May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred >> while communicating with 192.168.50.123@tcp. The ost_connect >> operation failed with -16 >> >> When you got this error message, you failed to run "ls", "df" ,"vi", >> "touch" and so on, which affect us to do anything in the file system. >> I think the ost_connect failure could report some error messages to >> users instead of causing any interactive actions stuck. >> >> Could someone give us some advice or any suggestions to solve this >> problem? >> >> Thank you very much in advance. >> >> >> Best Regards >> Qiulan Huang >> 2012-05-30 >> ==================================================================== >> Computing center,the Institute of High Energy Physics, China >> Huang, Qiulan Tel: (+86) 10 8823 6010-105 >> P.O. Box 918-7 Fax: (+86) 10 8823 6839 >> Beijing 100049 P.R. China Email: [email protected] >> <mailto:[email protected]> >> =================================================================== >> >> >> >
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
