On Sep 25, 2007 15:21 +0200, Niklas Edmundsson wrote: > Both OSS's were really locking up intermittently after this, it all > ended with having to do a hard reboot. Taking it slow and leisurely > mounting one OST, then wait, another one, then wait until we had all > four mounted resulted in a working lustre backend. > > We got tons of soft lockups and call traces, here are the first one > from each OSS (with some added interesting log lines from the first OSS): > -------------------8<------------------------ > [ 1376.614228] Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for > pid 6802: it was inactive for 100s > > [ 1388.238268] LustreError: 6877:0:(events.c:55:request_out_callback()) @@@ > type 4, status -5 [EMAIL PROTECTED] x38/t0 > o104->@NET_0x2000082ef4e80_UUID:15 lens 232/128 ref 2 fl Rpc:/0/0 rc 0/-22 > [ 1388.256244] LustreError: 6877:0:(events.c:55:request_out_callback()) > Skipped 1158333 previous similar messages > [ 1394.804079] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request()) > @@@ network error (sent at 1190712998, 0s ago) [EMAIL PROTECTED] x177/t0 > o104->@NET_0x2000082ef4e8e_UUID:15 lens 232/128 ref 1 fl Rpc:/0/0 rc 0/-22 > [ 1394.825111] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request()) > Skipped 1959868 previous similar messages > [ 1395.921898] BUG: soft lockup detected on CPU#0!
I can imagine if you are getting 2M requests timing out then the CPU will be busy. > [ 1396.080527] [<ffffffff884bf4a9>] > :ptlrpc:ptlrpc_set_next_timeout+0x19/0x170 > [ 1396.105652] [<ffffffff884980fb>] > :ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200 > [ 1396.113405] [<ffffffff88499133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290 > [ 1396.120369] [<ffffffff884ac5db>] > :ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720 > [ 1362.348687] [<ffffffff884c25ed>] > :ptlrpc:ptlrpc_expire_one_request+0xad/0x390 > [ 1362.389284] [<ffffffff884960fb>] > :ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200 > [ 1362.397318] [<ffffffff88497133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290 Looks like something related to the parallel blocking callback code. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
