Hi,


Today our mds started to behave unstable. /proc/fs/lustre/health_check file reported that mds device is not healthy. All clients connected to ddn_home file system got stuck and MDS server started to refuse client connections and after some time it started to evict clients. Can some one help me get to the bottom of this problem? Below I attached  log snippets from the MDS and one client

Thanks

Wojciech


Lustre-1.6.6
RHEL4
2.6.9-67.0.22.EL_lustre.1.6.6smp
600 clients
24 OST/4 OSS



Feb  6 11:42:34 mds01 kernel: LustreError: 19469:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920454, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home-
MDT0000_UUID lock: 0000010238fdb700/0x2dea4e283aa99c1c lrc: 3/0,1 mode: --/EX res: 38667385/3946061039 bits 0x2 rrc: 16 type: IBT flags: 4004030 remote: 0x0 expref: -99 pid 19469
Feb  6 11:42:34 mds01 kernel: LustreError: dumping log to /tmp/lustre-log.1233920554.19471
Feb  6 11:42:34 mds01 kernel: LustreError: 19469:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Feb  6 11:42:34 mds01 kernel: LustreError: 19509:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920454, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home-
MDT0000_UUID lock: 000001014281e040/0x2dea4e283aa9ba14 lrc: 3/1,0 mode: --/CR res: 38662560/3951344313 bits 0x1 rrc: 6 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19509
Feb  6 11:42:38 mds01 kernel: LustreError: 19506:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920458, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home-
MDT0000_UUID lock: 00000101882b6700/0x2dea4e283aabc0f4 lrc: 3/1,0 mode: --/CR res: 38662560/3951344313 bits 0x1 rrc: 6 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19506
Feb  6 11:42:38 mds01 kernel: LustreError: 19506:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 2 previous similar messages
Feb  6 11:42:39 mds01 kernel: LustreError: 19463:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920459, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home-
MDT0000_UUID lock: 000001036a58fdc0/0x2dea4e283aac7ad6 lrc: 3/1,0 mode: --/CR res: 38662560/3951344313 bits 0x1 rrc: 6 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19463
Feb  6 11:42:50 mds01 kernel: LustreError: 19498:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920470, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home-
MDT0000_UUID lock: 0000010242c97940/0x2dea4e283ab844cb lrc: 3/1,0 mode: --/CR res: 38667385/3946061039 bits 0x3 rrc: 16 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19498
Feb  6 11:42:50 mds01 kernel: LustreError: 19498:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 6 previous similar messages
Feb  6 11:44:17 mds01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 19469: it was inactive for 200s
Feb  6 11:44:17 mds01 kernel: Lustre: 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process 19471
Feb  6 11:44:17 mds01 kernel: ll_mdt_50     S 0000000000000000     0 19471      1         19472 19470 (L-TLB)

...
lustre watchdog log dumps
...
Feb  6 11:44:17 mds01 kernel: Lustre: 19507:0:(ldlm_lib.c:538:target_handle_reconnect()) ddn_home-MDT0000: 3d03a918-dd07-998f-226e-16a7634433bf reconnecting
Feb  6 11:44:17 mds01 kernel: Lustre: 19507:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 1 previous similar message
Feb  6 11:44:17 mds01 kernel: Lustre: 19507:0:(ldlm_lib.c:773:target_handle_connect()) ddn_home-MDT0000: refuse reconnection from 3d03a918-dd07-998f-226e-16a763443...@node-h27@tcp to 0x000001027449c000; still busy with 3 active RPCs
Feb  6 11:44:17 mds01 kernel: Lustre: 19507:0:(ldlm_lib.c:773:target_handle_connect()) Skipped 1 previous similar message
Feb  6 11:44:17 mds01 kernel: LustreError: 19507:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16)  r...@0000010402541c50 x2988280/t0 o38->3d03a918-dd07-998f-226e-16a763443...@net_0x200000a8f081b_uuid:0/0 lens 304/200
 e 0 to 0 dl 1233920754 ref 1 fl Interpret:/0/0 rc -16/0
Feb  6 11:44:17 mds01 kernel: LustreError: 19507:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 1 previous similar message
Feb  6 11:44:17 mds01 kernel: ll_mdt_103    S 0000000000000000     0 19525      1         19526 19524 (L-TLB)
...
lustre watchdog log dumps
...
Feb  6 11:44:35 mds01 kernel: Lustre: 19473:0:(ldlm_lib.c:538:target_handle_reconnect()) ddn_home-MDT0000: d225b081-9db9-b8c4-9e6d-6558af87cdeb reconnecting
Feb  6 11:44:35 mds01 kernel: LustreError: dumping log to /tmp/lustre-log.1233920674.19495
Feb  6 11:44:35 mds01 kernel: Lustre: 19473:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 2 previous similar messages
Feb  6 11:44:35 mds01 kernel: Lustre: 19473:0:(ldlm_lib.c:773:target_handle_connect()) ddn_home-MDT0000: refuse reconnection from d225b081-9db9-b8c4-9e6d-6558af87c...@node-h02@tcp to 0x00000102763b8000; still busy with 5 active RPCs
Feb  6 11:44:35 mds01 kernel: Lustre: 19473:0:(ldlm_lib.c:773:target_handle_connect()) Skipped 2 previous similar messages
Feb  6 11:44:35 mds01 kernel: LustreError: 19473:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16)  r...@000001007a4ee800 x2710478/t0 o38->d225b081-9db9-b8c4-9e6d-6558af87c...@net_0x200000a8f0802_uuid:0/0 lens 304/200
 e 0 to 0 dl 1233920774 ref 1 fl Interpret:/0/0 rc -16/0
Feb  6 11:44:35 mds01 kernel: LustreError: 19473:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 2 previous similar messages
Feb  6 11:45:00 mds01 kernel: LustreError: 19485:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920600, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home-
MDT0000_UUID lock: 00000101a3c94280/0x2dea4e283abe817a lrc: 3/1,0 mode: --/CR res: 38656801/3946041493 bits 0x3 rrc: 26 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19485
Feb  6 11:45:14 mds01 kernel: LustreError: 19550:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920614, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home-
MDT0000_UUID lock: 00000103d3b90280/0x2dea4e283abed080 lrc: 3/1,0 mode: --/CR res: 24610283/3605486555 bits 0x3 rrc: 23 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19550
Feb  6 11:45:14 mds01 kernel: LustreError: 19550:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 10 previous similar messages
Feb  6 11:45:31 mds01 kernel: LustreError: 19520:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233920631, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home-
MDT0000_UUID lock: 000001025e8334c0/0x2dea4e283abfa192 lrc: 3/1,0 mode: --/CR res: 24610283/3605486555 bits 0x3 rrc: 24 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19520
Feb  6 11:45:31 mds01 kernel: LustreError: 19520:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 1 previous similar message
Feb  6 11:46:40 mds01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 19485: it was inactive for 200s
Feb  6 11:46:40 mds01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb()) Skipped 1 previous similar message
Feb  6 11:46:40 mds01 kernel: Lustre: 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process 19485
Feb  6 11:46:40 mds01 kernel: Lustre: 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) Skipped 7 previous similar messages
Feb  6 11:46:40 mds01 kernel: ll_mdt_63     S 0000000000000000     0 19485      1         19486 19484 (L-TLB)
...
...
Feb  6 11:57:27 mds01 kernel: Lustre: ddn_home-MDT0000: haven't heard from client d225b081-9db9-b8c4-9e6d-6558af87cdeb (at node-...@tcp) in 227 seconds. I think it's dead, and I am evicting it.
Feb  6 11:57:27 mds01 kernel: Lustre: Skipped 4 previous similar messages
Feb  6 11:57:35 mds01 kernel: LustreError: 19467:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1233921355, 100s ago); not entering recovery in server code, just going back to sleep ns: mds-ddn_home-
MDT0000_UUID lock: 0000010341014dc0/0x2dea4e283ad5221d lrc: 3/1,0 mode: --/CR res: 24610283/3605486555 bits 0x3 rrc: 43 type: IBT flags: 4004000 remote: 0x0 expref: -99 pid 19467
Feb  6 11:57:35 mds01 kernel: LustreError: 19467:0:(ldlm_request.c:81:ldlm_expired_completion_wait()) Skipped 7 previous similar messages
Feb  6 11:58:43 mds01 kernel: Lustre: ddn_home-MDT0000: haven't heard from client a50193fa-1aef-44fe-f917-7b3063f7cc30 (at node-...@tcp) in 221 seconds. I think it's dead, and I am evicting it.
Feb  6 11:58:43 mds01 kernel: Lustre: Skipped 1 previous similar message
Feb  6 11:59:15 mds01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb()) Watchdog triggered for pid 19467: it was inactive for 200s
Feb  6 11:59:15 mds01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb()) Skipped 7 previous similar messages
Feb  6 11:59:15 mds01 kernel: Lustre: 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process 19467
Feb  6 11:59:15 mds01 kernel: Lustre: 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) Skipped 7 previous similar messages
Feb  6 11:59:15 mds01 kernel: Lustre: 19502:0:(ldlm_lib.c:538:target_handle_reconnect()) ddn_home-MDT0000: 3d03a918-dd07-998f-226e-16a7634433bf reconnecting
Feb  6 11:59:15 mds01 kernel: Lustre: 19502:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 6 previous similar messages
Feb  6 11:59:15 mds01 kernel: Lustre: 19502:0:(ldlm_lib.c:773:target_handle_connect()) ddn_home-MDT0000: refuse reconnection from 3d03a918-dd07-998f-226e-16a763443...@node-h27@tcp to 0x000001029c566000; still busy with 2 active RPCs
Feb  6 11:59:15 mds01 kernel: Lustre: 19502:0:(ldlm_lib.c:773:target_handle_connect()) Skipped 6 previous similar messages
Feb  6 11:59:15 mds01 kernel: LustreError: 19502:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16)  r...@0000010216915000 x2989437/t0 o38->3d03a918-dd07-998f-226e-16a763443...@net_0x200000a8f081b_uuid:0/0 lens 304/200
 e 0 to 0 dl 1233921655 ref 1 fl Interpret:/0/0 rc -16/0
Feb  6 11:59:15 mds01 kernel: LustreError: 19502:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 6 previous similar messages
Feb  6 11:59:15 mds01 kernel: ll_mdt_46     S 0000000000000000     0 19467      1         19468 19466 (L-TLB)


Client:
node-h02

Feb  6 11:44:30 node-h02 kernel: Lustre: Request x2710369 sent from ddn_home-MDT0000-mdc-00000100069fc400 to NID md...@tcp 200s ago has timed out (limit 200s).
Feb  6 11:44:30 node-h02 kernel: Lustre: Request x2710376 sent from ddn_home-MDT0000-mdc-00000100069fc400 to NID md...@tcp 200s ago has timed out (limit 200s).
Feb  6 11:44:30 node-h02 kernel: Lustre: ddn_home-MDT0000-mdc-00000100069fc400: Connection to service ddn_home-MDT0000 via nid md...@tcp was lost; in progress operations using this service will wait for recovery to complete.
Feb  6 11:44:30 node-h02 kernel: Lustre: Skipped 2 previous similar messages
Feb  6 11:44:34 node-h02 kernel: LustreError: 11-0: an error occurred while communicating with md...@tcp. The mds_connect operation failed with -16
Feb  6 11:46:10 node-h02 kernel: Lustre: Changing connection for ddn_home-MDT0000-mdc-00000100069fc400 to md...@tcp/md...@tcp
Feb  6 11:49:30 node-h02 kernel: Lustre: Request x2710506 sent from ddn_home-MDT0000-mdc-00000100069fc400 to NID md...@tcp 200s ago has timed out (limit 200s).
Feb  6 11:50:20 node-h02 kernel: Lustre: 15881:0:(import.c:410:import_select_connection()) ddn_home-MDT0000-mdc-00000100069fc400: tried all connections, increasing latency to 9s
Feb  6 11:50:20 node-h02 kernel: Lustre: Changing connection for ddn_home-MDT0000-mdc-00000100069fc400 to md...@tcp/md...@tcp
Feb  6 11:50:20 node-h02 kernel: LustreError: 167-0: This client was evicted by ddn_home-MDT0000; in progress operations using this service will fail.
Feb  6 11:50:20 node-h02 kernel: LustreError: 9041:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5
Feb  6 11:50:20 node-h02 kernel: LustreError: 9301:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5
Feb  6 11:50:20 node-h02 kernel: LustreError: 9189:0:(file.c:2925:ll_inode_revalidate_fini()) failure -5 inode 38656801
Feb  6 11:50:20 node-h02 kernel: LustreError: 9301:0:(mdc_locks.c:598:mdc_enqueue()) Skipped 1 previous similar message
Feb  6 11:50:20 node-h02 kernel: LustreError: 9080:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5
Feb  6 11:50:20 node-h02 kernel: LustreError: 9080:0:(file.c:2925:ll_inode_revalidate_fini()) failure -5 inode 38656801
Feb  6 11:50:20 node-h02 kernel: LustreError: 9080:0:(file.c:2925:ll_inode_revalidate_fini()) Skipped 1 previous similar message
Feb  6 11:50:20 node-h02 kernel: LustreError: 9367:0:(client.c:722:ptlrpc_import_delay_req()) @@@ IMP_INVALID  r...@00000101adf1d200 x2710664/t0 o35->ddn_home-mdt0000_u...@mds01@tcp:23/10 lens 296/688 e 0 to 200 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
Feb  6 11:50:20 node-h02 kernel: LustreError: 9367:0:(file.c:116:ll_close_inode_openhandle()) inode 26872571 mdc close failed: rc = -108
Feb  6 11:50:20 node-h02 kernel: Lustre: ddn_home-MDT0000-mdc-00000100069fc400: Connection restored to service ddn_home-MDT0000 using nid md...@tcp.
Feb  6 11:50:20 node-h02 kernel: LustreError: 9080:0:(mdc_request.c:741:mdc_close()) Unexpected: can't find mdc_open_data, but the close succeeded.  Please tell <http://bugzilla.lustre.org/>.
Feb  6 11:50:20 node-h02 kernel: LustreError: 9187:0:(mdc_request.c:741:mdc_close()) Unexpected: can't find mdc_open_data, but the close succeeded.  Please tell <http://bugzilla.lustre.org/>.
Feb  6 11:50:20 node-h02 kernel: LustreError: 9187:0:(mdc_request.c:741:mdc_close()) Skipped 77 previous similar messages
Feb  6 11:53:40 node-h02 kernel: Lustre: Request x2710802 sent from ddn_home-MDT0000-mdc-00000100069fc400 to NID md...@tcp 200s ago has timed out (limit 200s).
Feb  6 11:53:40 node-h02 kernel: Lustre: ddn_home-MDT0000-mdc-00000100069fc400: Connection to service ddn_home-MDT0000 via nid md...@tcp was lost; in progress operations using this service will wait for recovery to complete.
Feb  6 11:53:40 node-h02 kernel: LustreError: 11-0: an error occurred while communicating with md...@tcp. The mds_connect operation failed with -16
Feb  6 11:54:30 node-h02 kernel: Lustre: Changing connection for ddn_home-MDT0000-mdc-00000100069fc400 to md...@tcp/md...@tcp
Feb  6 11:57:50 node-h02 kernel: Lustre: Request x2710887 sent from ddn_home-MDT0000-mdc-00000100069fc400 to NID md...@tcp 200s ago has timed out (limit 200s).
Feb  6 11:58:40 node-h02 kernel: Lustre: 15881:0:(import.c:410:import_select_connection()) ddn_home-MDT0000-mdc-00000100069fc400: tried all connections, increasing latency to 14s
Feb  6 11:58:40 node-h02 kernel: Lustre: Changing connection for ddn_home-MDT0000-mdc-00000100069fc400 to md...@tcp/md...@tcp
Feb  6 11:58:40 node-h02 kernel: LustreError: 167-0: This client was evicted by ddn_home-MDT0000; in progress operations using this service will fail.
Feb  6 11:58:40 node-h02 kernel: LustreError: 10470:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5


_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to