On Sep 25, 2007  15:21 +0200, Niklas Edmundsson wrote:
> Both OSS's were really locking up intermittently after this, it all 
> ended with having to do a hard reboot. Taking it slow and leisurely 
> mounting one OST, then wait, another one, then wait until we had all 
> four mounted resulted in a working lustre backend.
> 
> We got tons of soft lockups and call traces, here are the first one 
> from each OSS (with some added interesting log lines from the first OSS):
> -------------------8<------------------------
> [ 1376.614228] Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for 
> pid 6802: it was inactive for 100s
> 
> [ 1388.238268] LustreError: 6877:0:(events.c:55:request_out_callback()) @@@ 
> type 4, status -5  [EMAIL PROTECTED] x38/t0 
> o104->@NET_0x2000082ef4e80_UUID:15 lens 232/128 ref 2 fl Rpc:/0/0 rc 0/-22
> [ 1388.256244] LustreError: 6877:0:(events.c:55:request_out_callback()) 
> Skipped 1158333 previous similar messages
> [ 1394.804079] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request()) 
> @@@ network error (sent at 1190712998, 0s ago)  [EMAIL PROTECTED] x177/t0 
> o104->@NET_0x2000082ef4e8e_UUID:15 lens 232/128 ref 1 fl Rpc:/0/0 rc 0/-22
> [ 1394.825111] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request()) 
> Skipped 1959868 previous similar messages
> [ 1395.921898] BUG: soft lockup detected on CPU#0!

I can imagine if you are getting 2M requests timing out then the CPU will
be busy.

> [ 1396.080527]  [<ffffffff884bf4a9>] 
> :ptlrpc:ptlrpc_set_next_timeout+0x19/0x170
> [ 1396.105652]  [<ffffffff884980fb>] 
> :ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200
> [ 1396.113405]  [<ffffffff88499133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290
> [ 1396.120369]  [<ffffffff884ac5db>] 
> :ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720

> [ 1362.348687]  [<ffffffff884c25ed>] 
> :ptlrpc:ptlrpc_expire_one_request+0xad/0x390
> [ 1362.389284]  [<ffffffff884960fb>] 
> :ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200
> [ 1362.397318]  [<ffffffff88497133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290

Looks like something related to the parallel blocking callback code.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to