Hello everyone, On lustre v1.8.4, since last night we are facing problems on our scratch FS. So I went to the respective mds where on /var/log/messages I found the following ::
Jan 24 23:12:13 mds kernel: LustreError: > 5107:0:(ldlm_lock.c:430:__ldlm_handle2lock()) ASSERTION(handle) failed > Jan 24 23:12:13 mds kernel: LustreError: > 5107:0:(ldlm_lock.c:430:__ldlm_handle2lock()) LBUG > Jan 24 23:12:13 mds kernel: Pid: 5107, comm: ll_mdt_106 > Jan 24 23:12:13 mds kernel: > Jan 24 23:12:13 mds kernel: Call Trace: > Jan 24 23:12:13 mds kernel: [<ffffffffa05d88ba>] > libcfs_debug_dumpstack+0x4a/0x70 [libcfs] > Jan 24 23:12:13 mds kernel: [<ffffffffa05d8e2c>] lbug_with_loc+0x6c/0xd0 > [libcfs] > Jan 24 23:12:13 mds kernel: [<ffffffffa05e2550>] tracefile_init+0x0/0x170 > [libcfs] > Jan 24 23:12:13 mds kernel: [<ffffffffa06eb24f>] > __ldlm_handle2lock+0x35f/0x3b0 [ptlrpc] > Jan 24 23:12:13 mds kernel: [<ffffffffa06ec00f>] > ldlm_lock_decref+0x1f/0xb0 [ptlrpc] > Jan 24 23:12:13 mds kernel: [<ffffffffa09ba124>] > mds_verify_child+0x4e4/0x8b0 [mds] > Jan 24 23:12:13 mds kernel: [<ffffffffa09bcf39>] > mds_get_parent_child_locked+0x4e9/0x980 [mds] > Jan 24 23:12:13 mds kernel: [<ffffffffa09e0886>] mds_open+0x716/0x3680 > [mds] > Jan 24 23:12:13 mds kernel: [<ffffffffa09b73c7>] > mds_reint_rec+0x157/0x2d0 [mds] > Jan 24 23:12:13 mds kernel: [<ffffffffa09ac207>] mds_reint+0x1d7/0x450 > [mds] > Jan 24 23:12:13 mds kernel: [<ffffffffa09b1ae0>] > mds_intent_policy+0x6f0/0xd20 [mds] > Jan 24 23:12:13 mds kernel: [<ffffffffa06ec958>] > ldlm_lock_enqueue+0x468/0xbf0 [ptlrpc] > Jan 24 23:12:13 mds kernel: [<ffffffffa070fe85>] > ldlm_handle_enqueue+0x265/0x1300 [ptlrpc] > Jan 24 23:12:13 mds kernel: [<ffffffffa09b2bcc>] mds_handle+0xabc/0x5150 > [mds] > Jan 24 23:12:13 mds kernel: [<ffffffffa073ea32>] > ptlrpc_server_handle_request+0x902/0x1330 [ptlrpc] > Jan 24 23:12:13 mds kernel: [<ffffffffa07411a4>] ptlrpc_main+0x6d4/0x1610 > [ptlrpc] > Jan 24 23:12:13 mds kernel: [<ffffffff8020cf49>] child_rip+0xa/0x11 > Jan 24 23:12:13 mds kernel: > Jan 24 23:12:13 mds kernel: LustreError: dumping log to > /tmp/lustre-log.1359065533.5107 By googling I found that this is a bug fixed in version 1.8.6 ( http://jira.whamcloud.com/browse/LU-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel ). After that follows a long list of lustre Errors as the one following that keeps growing as we speak. Jan 24 23:25:38 mds kernel: LustreError: > 23676:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed > out (enqueued at 1359066138, 200s ago); not entering recovery in server > code, just going back to sleep ns: mds-scratch-MDT0000_UUID lock: > ffff88080a81f800/0x48997a721de79abd lrc: 3/0,1 mode: --/EX res: > 381225206/1751756546 bits 0x3 rrc: 63 type: IBT flags: 0x4004030 remote: > 0x0 expref: -99 pid: 23676 timeout: 0 > Jan 24 23:25:38 mds kernel: LustreError: dumping log to > /tmp/lustre-log.1359066338.23676 > Jan 24 23:26:06 mds kernel: Lustre: > 23677:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000: > ba8ef0e3-630b-714c-4a37-b7558adba3d0 reconnecting > Jan 24 23:26:25 mds kernel: LustreError: > 23659:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed > out (enqueued at 1359066185, 200s ago); not entering recovery in server > code, just going back to sleep ns: mds-scratch-MDT0000_UUID lock: > ffff880c04ded200/0x48997a721e02cf62 lrc: 3/1,0 mode: --/CR res: > 381225206/1751756546 bits 0x3 rrc: 64 type: IBT flags: 0x4004000 remote: > 0x0 expref: -99 pid: 23659 timeout: 0 > Jan 24 23:26:26 mds kernel: LustreError: > 5071:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed > out (enqueued at 1359066186, 200s ago); not entering recovery in server > code, just going back to sleep ns: mds-scratch-MDT0000_UUID lock: > ffff880b4b213a00/0x48997a721e03a0b3 lrc: 3/1,0 mode: --/CR res: > 381225206/1751756546 bits 0x3 rrc: 64 type: IBT flags: 0x4004000 remote: > 0x0 expref: -99 pid: 5071 timeout: 0 > Jan 24 23:26:29 mds kernel: Lustre: > 5033:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000: > 58837d54-5d72-cecb-593b-e7d78c9c54d3 reconnecting > Jan 24 23:27:15 mds kernel: LustreError: > 5101:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed > out (enqueued at 1359066235, 200s ago); not entering recovery in server > code, just going back to sleep ns: mds-scratch-MDT0000_UUID lock: > ffff88090f7c7600/0x48997a721e1e8189 lrc: 3/1,0 mode: --/CR res: > 381225206/1751756546 bits 0x3 rrc: 67 type: IBT flags: 0x4004000 remote: > 0x0 expref: -99 pid: 5101 timeout: 0 > Jan 24 23:27:18 mds kernel: Lustre: > 23684:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000: > 141ba2f1-5a32-282b-1b9f-fde05c0caa1a reconnecting > Jan 24 23:27:22 mds kernel: Lustre: > 5099:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000: > e57b39b0-2e26-0e11-3995-e801a3fb4ce3 reconnecting > Jan 24 23:27:24 mds kernel: Lustre: > 23672:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000: > 72b809ed-de60-1c38-f17e-145a2294c6db reconnecting > Jan 24 23:27:24 mds kernel: Lustre: > 23672:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 1 previous > similar message > Jan 24 23:27:26 mds kernel: LustreError: > 23665:0:(ldlm_request.c:83:ldlm_expired_completion_wait()) ### lock timed > out (enqueued at 1359066246, 200s ago); not entering recovery in server > code, just going back to sleep ns: mds-scratch-MDT0000_UUID lock: > ffff8808c4952000/0x48997a721e2588eb lrc: 3/1,0 mode: --/CR res: > 381225206/1751756546 bits 0x3 rrc: 67 type: IBT flags: 0x4004000 remote: > 0x0 expref: -99 pid: 23665 timeout: 0 > Jan 24 23:27:29 mds kernel: Lustre: > 5108:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000: > c9029114-746d-964e-8712-906a310f598f reconnecting > Jan 24 23:27:29 mds kernel: Lustre: > 5108:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 1 previous > similar message > Jan 24 23:27:38 mds kernel: Lustre: > 5032:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000: > d3e7f205-2e08-6614-3bf0-c2344ea5bf09 reconnecting > Jan 24 23:27:38 mds kernel: Lustre: > 5032:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 2 previous > similar messages > Jan 24 23:27:56 mds kernel: Lustre: > 5066:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000: > 4e67a94e-7e0d-4b29-6272-587a72de2498 reconnecting > Jan 24 23:27:56 mds kernel: Lustre: > 5066:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 10 previous > similar messages > Jan 24 23:28:36 mds kernel: Lustre: > 23686:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000: > f5637638-bf4a-320a-6967-7cffaea36273 reconnecting > Jan 24 23:28:36 mds kernel: Lustre: > 23686:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 8 previous > similar messages > Jan 24 23:28:36 mds kernel: Lustre: > 5039:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any > time (5/-383), not sending early reply > Jan 24 23:28:36 mds kernel: req@ffff88054a4ea400 x1425048289875237/t0 > o101->ee75239f-4115-3c1d-a11a-7649df079499@NET_0x500000a031c01_UUID:0/0 > lens 1024/3680 e 4 to 0 dl 1359066521 ref 2 fl Interpret:/0/0 rc 0/0 Jan 24 23:30:54 mds kernel: Lustre: > 5023:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse > reconnection from ee75239f-4115-3c1d-a11a-7649df079499@X.X.28.1@o2ib to > 0xffff880beb978200; still busy with 1 active RPCs > Jan 24 23:30:54 mds kernel: LustreError: > 5023:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) > req@ffff880425d99c00 x1425048289890866/t0 > o38->ee75239f-4115-3c1d-a11a-7649df079499@NET_0x500000a031c01_UUID:0/0 > lens 368/264 e 0 to 0 dl 1359066754 ref 1 fl Interpret:/0/0 rc -16/0 > Jan 24 23:31:01 mds kernel: Lustre: > 5124:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse > reconnection from ee75239f-4115-3c1d-a11a-7649df079499@X.X.28.1@o2ib to > 0xffff880beb978200; still busy with 1 active RPCs > Jan 24 23:31:01 mds kernel: LustreError: > 5124:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) > req@ffff88081b5c4000 x1425048289890867/t0 > o38->ee75239f-4115-3c1d-a11a-7649df079499@NET_0x500000a031c01_UUID:0/0 > lens 368/264 e 0 to 0 dl 1359066761 ref 1 fl Interpret:/0/0 rc -16/0 > Jan 24 23:31:08 mds kernel: Lustre: > 5086:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse > reconnection from ee75239f-4115-3c1d-a11a-7649df079499@X.X.28.1@o2ib to > 0xffff880beb978200; still busy with 1 active RPCs > Jan 24 23:31:08 mds kernel: LustreError: > 5086:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) > req@ffff8806213cfc00 x1425048289891170/t0 > o38->ee75239f-4115-3c1d-a11a-7649df079499@NET_0x500000a031c01_UUID:0/0 > lens 368/264 e 0 to 0 dl 1359066768 ref 1 fl Interpret:/0/0 rc -16/0 > Jan 24 23:31:15 mds kernel: Lustre: > 5081:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse > reconnection from ee75239f-4115-3c1d-a11a-7649df079499@X.X.28.1@o2ib to > 0xffff880beb978200; still busy with 1 active RPCs > Jan 24 23:31:15 mds kernel: LustreError: > 5081:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) > req@ffff88011d249000 x1425048289891171/t0 > o38->ee75239f-4115-3c1d-a11a-7649df079499@NET_0x500000a031c01_UUID:0/0 > lens 368/264 e 0 to 0 dl 1359066775 ref 1 fl Interpret:/0/0 rc -16/0 > Jan 24 23:31:22 mds kernel: Lustre: > 5017:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse > reconnection from ee75239f-4115-3c1d-a11a-7649df079499@X.X.28.1@o2ib to > 0xffff880beb978200; still busy with 1 active RPCs > Jan 24 23:31:22 mds kernel: LustreError: > 5017:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) > req@ffff8805d32c3000 x1425048289891172/t0 > o38->ee75239f-4115-3c1d-a11a-7649df079499@NET_0x500000a031c01_UUID:0/0 > lens 368/264 e 0 to 0 dl 1359066782 ref 1 fl Interpret:/0/0 rc -16/0 > Jan 24 23:31:29 mds kernel: Lustre: > 5040:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse > reconnection from ee75239f-4115-3c1d-a11a-7649df079499@X.X.28.1@o2ib to > 0xffff880beb978200; still busy with 1 active RPCs > Jan 24 23:31:29 mds kernel: LustreError: > 5040:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) > req@ffff8803a768a800 x1425048289891173/t0 > o38->ee75239f-4115-3c1d-a11a-7649df079499@NET_0x500000a031c01_UUID:0/0 > lens 368/264 e 0 to 0 dl 1359066789 ref 1 fl Interpret:/0/0 rc -16/0 > Jan 24 23:31:43 mds kernel: Lustre: > 5039:0:(ldlm_lib.c:872:target_handle_connect()) scratch-MDT0000: refuse > reconnection from ee75239f-4115-3c1d-a11a-7649df079499@X.X.28.1@o2ib to > 0xffff880beb978200; still busy with 1 active RPCs > Jan 24 23:31:43 mds kernel: Lustre: > 5039:0:(ldlm_lib.c:872:target_handle_connect()) Skipped 1 previous similar > message > Jan 24 23:31:43 mds kernel: LustreError: > 5039:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-16) > req@ffff88040fe6b000 x1425048289891477/t0 > o38->ee75239f-4115-3c1d-a11a-7649df079499@NET_0x500000a031c01_UUID:0/0 > lens 368/264 e 0 to 0 dl 1359066803 ref 1 fl Interpret:/0/0 rc -16/0 > Jan 24 23:31:43 mds kernel: LustreError: > 5039:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 1 previous similar > message > Jan 24 23:31:50 mds kernel: Lustre: > 23682:0:(ldlm_lib.c:572:target_handle_reconnect()) scratch-MDT0000: > ee75239f-4115-3c1d-a11a-7649df079499 reconnecting > Jan 24 23:31:50 mds kernel: Lustre: > 23682:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 22 previous > similar messages I was wondering whether a failover of the MDT and then reboot the MDS would solve the problem. If you have any idea or proposal I would be more than happy to hear, Stelios.
_______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss