More information: The frequency of these errors was dramatically reduced by changing /proc/fs/lustre/osc/fdfs-OST000[0-3]-osc/max_rpcs_in_flight from 8 to 32.
Processor, memory, and disk I/O on the servers is not high, is there a reason for not increasing max_rpcs_in_flight from 32 to 48 or 64? Is there a limit on how high I can set this value? Best regards, Aaron On Tue, May 17, 2011 at 8:13 PM, Aaron Everett <[email protected]> wrote: > Hi all, > > We've been running Lustre 1.6.6 for several years and are deploying 1.8.5 > on some new hardware. When under load we've been seeing random kernel panics > on many of the clients. We are running 2.6.18-194.17.1.el5_lustre.1.8.5 on > the servers (shared MDT/MGS, and 4 OST's. We have patchless clients > running 2.6.18-238.9.1.el5 (all CentOS). > > On the MDT, the following is logged in /var/log/messages: > > May 17 16:46:44 lustre-mdt-00 kernel: Lustre: > 5878:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1368993021040034 sent from fdfs-MDT0000 to NID 172.16.14.219@tcp 7s ago > has timed out (7s prior to deadline). > May 17 16:46:44 lustre-mdt-00 kernel: > req@ffff8105f140b800x1368993021040034/t0 > o104->@NET_0x20000ac100edb_UUID:15/16 lens 296/384 e 0 > to 1 dl 1305665204 ref 1 fl Rpc:N/0/0 rc 0/0 > May 17 16:46:44 lustre-mdt-00 kernel: Lustre: > 5878:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 39 previous > similar messages > May 17 16:46:44 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A > client on nid 172.16.14.219@tcp was evicted due to a lock blocking > callback to 172.16.14.219@tcp timed out: rc -107 > May 17 16:46:52 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A > client on nid 172.16.14.225@tcp was evicted due to a lock blocking > callback to 172.16.14.225@tcp timed out: rc -107 > May 17 16:46:52 lustre-mdt-00 kernel: LustreError: > 6227:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED > req@ffff81181ccb6800 x1368993021041016/t0 > o104->@NET_0x20000ac100ee1_UUID:15/16 lens 296/384 e 0 to 1 dl 0 ref 1 fl > Rpc:N/0/0 rc 0/0 > May 17 16:46:52 lustre-mdt-00 kernel: LustreError: > 6227:0:(ldlm_lockd.c:607:ldlm_handle_ast_error()) ### client (nid > 172.16.14.225@tcp) returned 0 from blocking AST ns: mds-fdfs-MDT0000_UUID > lock: ffff81169f590a00/0x767f56e4ad136f72 lrc: 4/0,0 mode: CR/CR res: > 35202584/110090815 bits 0x3 rrc: 25 type: IBT flags: 0x4000020 remote: > 0x364122c82e3aca01 expref: 229900 pid: 6310 timeout: 4386580591 > May 17 16:46:59 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A > client on nid 172.16.14.230@tcp was evicted due to a lock blocking > callback to 172.16.14.230@tcp timed out: rc -107 > May 17 16:46:59 lustre-mdt-00 kernel: LustreError: Skipped 6 previous > similar messages > May 17 16:47:07 lustre-mdt-00 kernel: Lustre: > 6688:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1368993021041492 sent from fdfs-MDT0000 to NID 172.16.14.229@tcp 7s ago > has timed out (7s prior to deadline). > May 17 16:47:07 lustre-mdt-00 kernel: > req@ffff81093052b000x1368993021041492/t0 > o104->@NET_0x20000ac100ee5_UUID:15/16 lens 296/384 e 0 > to 1 dl 1305665227 ref 1 fl Rpc:N/0/0 rc 0/0 > May 17 16:47:07 lustre-mdt-00 kernel: Lustre: > 6688:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 8 previous > similar messages > May 17 16:47:07 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A > client on nid 172.16.14.229@tcp was evicted due to a lock blocking > callback to 172.16.14.229@tcp timed out: rc -107 > May 17 16:47:07 lustre-mdt-00 kernel: LustreError: Skipped 8 previous > similar messages > May 17 16:50:16 lustre-mdt-00 kernel: Lustre: MGS: haven't heard from > client c8e311a5-f1d6-7197-1021-c5a02c1c5b14 (at 172.16.14.230@tcp) in 228 > seconds. I think it's dead, and I am evicting it. > > On the clients, there is a kernel panic, with the following message on the > screen: > > Code: 48 89 08 31 c9 48 89 12 48 89 52 08 ba 01 00 00 00 83 83 10 > RIP [<ffffffff8891ddcd>] :mdc:mdc_exit_request+0x6d/0xb0 > RSP <ffff81028c137858> > CR2: 0000000000003877 > <0>Kernel panic - not syncing: Fatal exception > > We're running the same set of jobs on both the 1.6.6 lustre filesystem and > the 1.8.5 lustre filesystem. Only the 1.8.5 clients crash, the 1.6.6 clients > that are also using the new servers never exhibit this issue. I'm assuming > there is a setting on the 1.8.5 clients that needs to be adjusted, but I'm > searching for help. > > Best regards, > Aaron >
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
