Hello Ashok
is the cluster hanging or otherwise behaving badly? The logs below show
that the client
lost connection to 10.148.0.106 for 10seconds or so. It should have
recovered ok.
If you want further help from the list you need to add more detail about
the cluster i.e.
A general description of the number of OSS/OST, clients, version of
lustre etc, and a description
of what is actually going wrong... ie hanging, offline etc
The first thing is to check the infrastructure.. ie. in this case you
should check your IB network for errors
On 30-September-2011 2:39 PM, Ashok nulguda wrote:
Dear All,
I am having lustre error on my HPC as given below.Please any one can
help me to resolve this problem.
Thanks in Advance.
Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous
similar message
Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre:
lustre-OST0008-osc-ffff880b272cf800: Connection to service
lustre-OST0008 via nid 10.148.0.106@o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID
10.148.0.106@o2ib 7s ago has timed out (7s prior to deadline).
Sep 30 08:40:24 service0 kernel: [343139.837263]
req@ffff880a5f800c00 x1380984193067288/t0
o3->[email protected]@o2ib:6/4 lens 448/592 e 0 to 1 dl
1317352224 ref 2 fl Rpc:/0/0 rc 0/0
Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous
similar messages
Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError:
9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from
cancel RPC: canceling anyway
Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError:
9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous
similar message
Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError:
9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -11
Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError:
9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous
similar message
Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID
10.148.0.106@o2ib 7s ago has timed out (7s prior to deadline).
Sep 30 08:40:25 service0 kernel: [343140.837311]
req@ffff880a557c4400 x1380984193067299/t0
o3->[email protected]@o2ib:6/4 lens 448/592 e 0 to 1 dl
1317352225 ref 2 fl Rpc:/0/0 rc 0/0
Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages
Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError:
30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from
cancel RPC: canceling anyway
Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError:
22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -11
Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError:
30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous
similar message
Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre:
22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID
10.148.0.106@o2ib 14s ago has timed out (14s prior to deadline).
Sep 30 08:40:33 service0 kernel: [343148.245686]
req@ffff8805c879e800 x1380984193067302/t0
o103->[email protected]@o2ib:17/18 lens 296/384 e 0 to
1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0
Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre:
22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous
similar messages
Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError:
22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from
cancel RPC: canceling anyway
Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError:
22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -11
Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError:
22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1
previous similar message
Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an
error occurred while communicating with 10.148.0.106@o2ib. The
ost_connect operation failed with -16
Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped
1 previous similar message
Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0:
This client was evicted by lustre-OST000b; in progress operations
using this service will fail.
Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError:
30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError:
8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req@ffff88049528c400 x1380984193067406/t0
o3->[email protected]@o2ib:6/4 lens 448/592 e 0 to 1 dl
0 ref 2 fl Rpc:/0/0 rc 0/0
Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre:
lustre-OST0000-osc-ffff880b272cf800: Connection restored to service
lustre-OST0000 using nid 10.148.0.106@o2ib.
Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre:
lustre-OST0006-osc-ffff880b272cf800: Connection restored to service
lustre-OST0006 using nid 10.148.0.106@o2ib.
Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre:
lustre-OST0003-osc-ffff880b272cf800: Connection restored to service
lustre-OST0003 using nid 10.148.0.106@o2ib.
Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3
previous similar messages
Thanks and Regards
Ashok
--
*Ashok Nulguda
*
*TATA ELXSI LTD*
*Mb : +91 9689945767
*
*Email :[email protected] <mailto:[email protected]>*
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss
--
Brian O'Connor
-------------------------------------------------
SGI Consulting
Email: [email protected], Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA http://www.sgi.com/support/services
-------------------------------------------------
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss