Hi all,

We've been running Lustre 1.6.6 for several years and are deploying 1.8.5 on
some new hardware. When under load we've been seeing random kernel panics on
many of the clients. We are running 2.6.18-194.17.1.el5_lustre.1.8.5 on the
servers (shared MDT/MGS, and 4 OST's. We have patchless clients
running 2.6.18-238.9.1.el5 (all CentOS).

On the MDT, the following is logged in /var/log/messages:

May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
5878:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1368993021040034 sent from fdfs-MDT0000 to NID 172.16.14.219@tcp 7s ago has
timed out (7s prior to deadline).
May 17 16:46:44 lustre-mdt-00 kernel:
req@ffff8105f140b800x1368993021040034/t0
o104->@NET_0x20000ac100edb_UUID:15/16 lens 296/384 e 0
to 1 dl 1305665204 ref 1 fl Rpc:N/0/0 rc 0/0
May 17 16:46:44 lustre-mdt-00 kernel: Lustre:
5878:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 39 previous
similar messages
May 17 16:46:44 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
client on nid 172.16.14.219@tcp was evicted due to a lock blocking callback
to 172.16.14.219@tcp timed out: rc -107
May 17 16:46:52 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
client on nid 172.16.14.225@tcp was evicted due to a lock blocking callback
to 172.16.14.225@tcp timed out: rc -107
May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
6227:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
req@ffff81181ccb6800 x1368993021041016/t0
o104->@NET_0x20000ac100ee1_UUID:15/16 lens 296/384 e 0 to 1 dl 0 ref 1 fl
Rpc:N/0/0 rc 0/0
May 17 16:46:52 lustre-mdt-00 kernel: LustreError:
6227:0:(ldlm_lockd.c:607:ldlm_handle_ast_error()) ### client (nid
172.16.14.225@tcp) returned 0 from blocking AST ns: mds-fdfs-MDT0000_UUID
lock: ffff81169f590a00/0x767f56e4ad136f72 lrc: 4/0,0 mode: CR/CR res:
35202584/110090815 bits 0x3 rrc: 25 type: IBT flags: 0x4000020 remote:
0x364122c82e3aca01 expref: 229900 pid: 6310 timeout: 4386580591
May 17 16:46:59 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
client on nid 172.16.14.230@tcp was evicted due to a lock blocking callback
to 172.16.14.230@tcp timed out: rc -107
May 17 16:46:59 lustre-mdt-00 kernel: LustreError: Skipped 6 previous
similar messages
May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
6688:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1368993021041492 sent from fdfs-MDT0000 to NID 172.16.14.229@tcp 7s ago has
timed out (7s prior to deadline).
May 17 16:47:07 lustre-mdt-00 kernel:
req@ffff81093052b000x1368993021041492/t0
o104->@NET_0x20000ac100ee5_UUID:15/16 lens 296/384 e 0
to 1 dl 1305665227 ref 1 fl Rpc:N/0/0 rc 0/0
May 17 16:47:07 lustre-mdt-00 kernel: Lustre:
6688:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 8 previous
similar messages
May 17 16:47:07 lustre-mdt-00 kernel: LustreError: 138-a: fdfs-MDT0000: A
client on nid 172.16.14.229@tcp was evicted due to a lock blocking callback
to 172.16.14.229@tcp timed out: rc -107
May 17 16:47:07 lustre-mdt-00 kernel: LustreError: Skipped 8 previous
similar messages
May 17 16:50:16 lustre-mdt-00 kernel: Lustre: MGS: haven't heard from client
c8e311a5-f1d6-7197-1021-c5a02c1c5b14 (at 172.16.14.230@tcp) in 228 seconds.
I think it's dead, and I am evicting it.

On the clients, there is a kernel panic, with the following message on the
screen:

Code: 48 89 08 31 c9 48 89 12 48 89 52 08 ba 01 00 00 00 83 83 10
RIP   [<ffffffff8891ddcd>]  :mdc:mdc_exit_request+0x6d/0xb0
 RSP  <ffff81028c137858>
CR2:  0000000000003877
 <0>Kernel panic - not syncing: Fatal exception

We're running the same set of jobs on both the 1.6.6 lustre filesystem and
the 1.8.5 lustre filesystem. Only the 1.8.5 clients crash, the 1.6.6 clients
that are also using the new servers never exhibit this issue. I'm assuming
there is a setting on the 1.8.5 clients that needs to be adjusted, but I'm
searching for help.

Best regards,
Aaron
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to