Hello Ashok

is the cluster hanging or otherwise behaving badly? The logs below show that the client lost connection to 10.148.0.106 for 10seconds or so. It should have recovered ok.

If you want further help from the list you need to add more detail about the cluster i.e. A general description of the number of OSS/OST, clients, version of lustre etc, and a description
of what is actually going wrong... ie hanging, offline etc

The first thing is to check the infrastructure.. ie. in this case you should check your IB network for errors



On 30-September-2011 2:39 PM, Ashok nulguda wrote:
Dear All,

I am having lustre error on my HPC as given below.Please any one can help me to resolve this problem.
Thanks in Advance.
Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous similar message Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre: lustre-OST0008-osc-ffff880b272cf800: Connection to service lustre-OST0008 via nid 10.148.0.106@o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID 10.148.0.106@o2ib 7s ago has timed out (7s prior to deadline). Sep 30 08:40:24 service0 kernel: [343139.837263] req@ffff880a5f800c00 x1380984193067288/t0 o3->[email protected]@o2ib:6/4 lens 448/592 e 0 to 1 dl 1317352224 ref 2 fl Rpc:/0/0 rc 0/0 Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous similar messages Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError: 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError: 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous similar message Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError: 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError: 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous similar message Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID 10.148.0.106@o2ib 7s ago has timed out (7s prior to deadline). Sep 30 08:40:25 service0 kernel: [343140.837311] req@ffff880a557c4400 x1380984193067299/t0 o3->[email protected]@o2ib:6/4 lens 448/592 e 0 to 1 dl 1317352225 ref 2 fl Rpc:/0/0 rc 0/0 Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre: 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError: 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError: 22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError: 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous similar message Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre: 22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID 10.148.0.106@o2ib 14s ago has timed out (14s prior to deadline). Sep 30 08:40:33 service0 kernel: [343148.245686] req@ffff8805c879e800 x1380984193067302/t0 o103->[email protected]@o2ib:17/18 lens 296/384 e 0 to 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0 Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre: 22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError: 22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError: 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError: 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous similar message Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an error occurred while communicating with 10.148.0.106@o2ib. The ost_connect operation failed with -16 Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped 1 previous similar message Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: This client was evicted by lustre-OST000b; in progress operations using this service will fail. Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError: 30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5 Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError: 8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff88049528c400 x1380984193067406/t0 o3->[email protected]@o2ib:6/4 lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0 Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre: lustre-OST0000-osc-ffff880b272cf800: Connection restored to service lustre-OST0000 using nid 10.148.0.106@o2ib. Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre: lustre-OST0006-osc-ffff880b272cf800: Connection restored to service lustre-OST0006 using nid 10.148.0.106@o2ib. Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre: lustre-OST0003-osc-ffff880b272cf800: Connection restored to service lustre-OST0003 using nid 10.148.0.106@o2ib. Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 previous similar messages


Thanks and Regards
Ashok

--
*Ashok Nulguda
*
*TATA ELXSI LTD*
*Mb : +91 9689945767
*
*Email :[email protected] <mailto:[email protected]>*



_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss


--
Brian O'Connor
-------------------------------------------------
SGI Consulting
Email: [email protected], Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA http://www.sgi.com/support/services
-------------------------------------------------



_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to