Hi,dearlist

We got a high frequency of logining node crash these days. The scene is that we 
can't remote access to the nodes but we can access them in the terminal. We run 
the command :
ps -ef |grep ldlm

root      3823     1  0 10:57 ?        00:00:00 [ldlm_bl_00]
root      3824     1  0 10:57 ?        00:00:00 [ldlm_bl_01]
root      3825     1  0 10:57 ?        00:00:00 [ldlm_bl_02]
root      3826     1  0 10:57 ?        00:00:00 [ldlm_bl_03]
root      3827     1  0 10:57 ?        00:00:00 [ldlm_bl_04]
root      3828     1  0 10:57 ?        00:00:00 [ldlm_bl_05]
root      3829     1  0 10:57 ?        00:00:00 [ldlm_bl_06]
root      3830     1  0 10:57 ?        00:00:00 [ldlm_bl_07]
root      3831     1  0 10:57 ?        00:00:00 [ldlm_cn_00]
root      3832     1  0 10:57 ?        00:00:00 [ldlm_cn_01]
root      3834     1  0 10:57 ?        00:00:00 [ldlm_cn_02]
root      3835     1  0 10:57 ?        00:00:00 [ldlm_cn_03]
root      3836     1  0 10:57 ?        00:00:00 [ldlm_cn_04]
root      3837     1  0 10:57 ?        00:00:00 [ldlm_cn_05]
root      3838     1  0 10:57 ?        00:00:00 [ldlm_cn_06]
root      3839     1  0 10:57 ?        00:00:00 [ldlm_cn_07]
root      3840     1  0 10:57 ?        00:00:00 [ldlm_cb_00]
root      3841     1  0 10:57 ?        00:00:00 [ldlm_cb_01]
root      3842     1  0 10:57 ?        00:00:00 [ldlm_cb_02]
root      3843     1  0 10:57 ?        00:00:00 [ldlm_cb_03]
root      3844     1  0 10:57 ?        00:00:00 [ldlm_cb_04]
root      3845     1  0 10:57 ?        00:00:00 [ldlm_cb_05]
root      3846     1  0 10:57 ?        00:00:00 [ldlm_cb_06]
root      3847     1  0 10:57 ?        00:00:00 [ldlm_cb_07]
.
.
.


we can see many processes about ldlm, it's up to hundreds. As a result, the 
load avarage is too high (155.0 165.0 145.0) to work normally.However, we have 
no idea and have to restart the nodes. At the same time,we can get the log as 
follows:
The filesystem features:
 
Server: lustre 1.6.6
Client: lustre 1.6.5

Can someone else get the same problem? 
I will appreciate for your any help!


Nov  6 09:24:10 lxslc22 kernel: LustreError: 
30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: 
canceling anyway
Nov  6 09:24:10 lxslc22 kernel: LustreError: 
30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar 
messages
Nov  6 09:24:10 lxslc22 kernel: LustreError: 
30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
                                                                                
                                                12745,1       92%
Nov  6 09:24:10 lxslc22 kernel: LustreError: 
30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar 
messages
Nov  6 09:24:10 lxslc22 kernel: LustreError: 
30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Nov  6 09:24:10 lxslc22 kernel: LustreError: 
30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 10 previous 
similar messages
Nov  6 09:24:14 lxslc22 hm[4390]: Server went down, finding new server.
Nov  6 09:24:49 lxslc22 last message repeated 7 times
Nov  6 09:25:04 lxslc22 last message repeated 3 times
Nov  6 09:25:08 lxslc22 kernel: Lustre: Request x1111842 sent from 
mgc192.168.50...@tcp to NID 192.168.50...@tcp 500s ago has timed out (limit 
500s).
Nov  6 09:25:08 lxslc22 kernel: Lustre: Skipped 29 previous similar messages
Nov  6 09:25:09 lxslc22 hm[4390]: Server went down, finding new server.
Nov  6 09:25:44 lxslc22 last message repeated 7 times
Nov  6 09:26:49 lxslc22 last message repeated 13 times
Nov  6 09:27:09 lxslc22 last message repeated 4 times
Nov  6 09:27:13 lxslc22 kernel: Lustre: 
3728:0:(import.c:395:import_select_connection()) besfs-MDT0000-mdc-f7e14200: 
tried all connections, increasing latency to 51s
Nov  6 09:27:13 lxslc22 kernel: Lustre: 
3728:0:(import.c:395:import_select_connection()) Skipped 17 previous similar 
messages
Nov  6 09:27:14 lxslc22 hm[4390]: Server went down, finding new server.
Nov  6 09:27:49 lxslc22 last message repeated 7 times
Nov  6 09:28:54 lxslc22 last message repeated 13 times
Nov  6 09:29:59 lxslc22 last message repeated 13 times
Nov  6 09:31:04 lxslc22 last message repeated 13 times
Nov  6 09:32:09 lxslc22 last message repeated 13 times
Nov  6 09:33:14 lxslc22 last message repeated 13 times
Nov  6 09:34:19 lxslc22 last message repeated 13 times
Nov  6 09:34:54 lxslc22 last message repeated 7 times
Nov  6 09:34:55 lxslc22 kernel: LustreError: 
30521:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: 
canceling anyway
Nov  6 09:34:55 lxslc22 kernel: LustreError: 
30521:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar 
messages
Nov  6 09:34:55 lxslc22 kernel: LustreError: 
30521:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Nov  6 09:34:55 lxslc22 kernel: LustreError: 
30521:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 10 previous 
similar messages
Nov  6 10:02:38 lxslc22 kernel: Lustre: Request x1113069 sent from 
besfs-OST0008-osc-f7e14200 to NID 192.168.50...@tcp 500s ago has timed out 
(limit 500s).
Nov  6 10:02:38 lxslc22 kernel: Lustre: Skipped 18 previous similar messages
Nov  6 10:02:39 lxslc22 hm[4390]: Server went down, finding new server.
Nov  6 10:03:14 lxslc22 last message repeated 7 times
Nov  6 10:04:19 lxslc22 last message repeated 13 times
Nov  6 10:04:39 lxslc22 last message repeated 4 times
Nov  6 10:04:43 lxslc22 kernel: Lustre: 
3728:0:(import.c:395:import_select_connection()) besfs-MDT0000-mdc-f7e14200: 
tried all connections, increasing latency to 51s
Nov  6 10:04:43 lxslc22 kernel: Lustre: 
3728:0:(import.c:395:import_select_connection()) Skipped 34 previous similar 
messages
Nov  6 10:04:44 lxslc22 hm[4390]: Server went down, finding new server.
Nov  6 10:05:19 lxslc22 last message repeated 7 times
Nov  6 10:06:24 lxslc22 last message repeated 13 times
Nov  6 10:06:44 lxslc22 last message repeated 4 times
Nov  6 10:06:46 lxslc22 kernel: LustreError: 
30582:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: 
canceling anyway
Nov  6 10:06:46 lxslc22 kernel: LustreError: 
30582:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 9 previous similar 
messages
Nov  6 10:06:46 lxslc22 kernel: LustreError: 
30582:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Nov  6 10:06:46 lxslc22 kernel: LustreError: 
30582:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 9 previous similar 
messages


Thanks,
Sarea

2009-11-06 



huangql 
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to