Were are running a patched (mballoc and others) 1.6.3 lustre on x86_64 platform under Centos4.5. Kernel is ...
Linux submit.ufhpc 2.6.18-8.1.14.el5Lustre #1 SMP Fri Oct 12 15:51:56 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux and is the same on the clients and server side. We have both tcp and o2ib NIDS. It seems to work pretty well most of the time but yesterday a large (256-way) job started and it seems that the MDS pretty much just fell over. This occurred at 18:36 (see log below). The whole cluster went into recovery mode as though we had rebooted the MDS. Prior to that we were getting some "target send reply msg" and "no handle for file close" errors. We know that each thread of the 256-way job opens a writes to a file independently but roughly concurrently so it probably got hit pretty hard all at once. Nonetheless, we hit it just as hard in our pre- production testing and it held up very well. So the questions are... 1. Does anyone recognize the errors below as known bugs for which there are patches/fixes we missed? 2. Looking at the tracebacks, I can't help but wonder if there is some ldlm parameter we need to tweak for a cluster of our size. Note that we have 320 o2ib clients and 80 tcp clients (400 nodes/1600 cores). Our MDS is a 4-way Opteron 2212 w/ 8GB RAM. We have 8 OSS's and 24 OST's (3 per OSS). Each OSS is a 4-way Opteron 275 with 4GB RAM and 3 2Gb/s FC HCAs connected to Xyratex 4900FC storage arrays. Any ideas? Suggestions? Thanks, Charlie Taylor UF HPC Center Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) [EMAIL PROTECTED] x183667281/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 2 previous similar messages Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30269183: cookie 0x23710b5b3ed594eb [EMAIL PROTECTED] x183716808/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: 1474:mds_close()) Skipped 6 previous similar messages Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) [EMAIL PROTECTED] x183716808/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 6 previous similar messages Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30298600: cookie 0x23710b5b3c868998 [EMAIL PROTECTED] x183809041/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: 1474:mds_close()) Skipped 1 previous similar message Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) [EMAIL PROTECTED] x183809041/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 1 previous similar message Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30205356: cookie 0x23710b5b3dfc4c91 [EMAIL PROTECTED] x183817543/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: 1474:mds_close()) Skipped 5 previous similar messages Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) [EMAIL PROTECTED] x183817543/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 5 previous similar messages Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 24185989: cookie 0x23710b5b3ed9874c [EMAIL PROTECTED] x183830524/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: 1474:mds_close()) Skipped 1 previous similar message Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) [EMAIL PROTECTED] x183830524/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 1 previous similar message Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 28959573: cookie 0x23710b5b3eda7dd4 [EMAIL PROTECTED] x183871936/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: 1474:mds_close()) Skipped 1 previous similar message Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) [EMAIL PROTECTED] x183871936/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 1 previous similar message Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30137171: cookie 0x23710b5b3c790ddb [EMAIL PROTECTED] x183904052/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) [EMAIL PROTECTED] x183904052/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:53:00 hpcmds kernel: Lustre: ufhpc-MDT0000: haven't heard from client 8514a628-1f97-0759-0932-de3e470cbb4f (at [EMAIL PROTECTED]) in 2252 seconds. I think it's dead, and I am evicting it. Jan 10 13:53:00 hpcmds kernel: Lustre: MGS: haven't heard from client 3e041a15-91e3-5c94-78fa-3b511d86bcd9 (at [EMAIL PROTECTED]) in 2252 seconds. I think it's dead, and I am evicting it. Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30296257: cookie 0x23710b5b3c18eca5 [EMAIL PROTECTED] x184041701/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: 1474:mds_close()) Skipped 2 previous similar messages Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) [EMAIL PROTECTED] x184041701/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 2 previous similar messages Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 29841036: cookie 0x23710b5b3d470bf2 [EMAIL PROTECTED] x184196858/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: 1474:mds_close()) Skipped 4 previous similar messages Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) [EMAIL PROTECTED] x184196858/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 4 previous similar messages Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: 1474:mds_close()) @@@ no handle for file close ino 30101082: cookie 0x23710b5b3e8e62f1 [EMAIL PROTECTED] x184423975/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: 1474:mds_close()) Skipped 3 previous similar messages Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-116) [EMAIL PROTECTED] x184423975/t0 o35- >[EMAIL PROTECTED]:-1 lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 3 previous similar messages Jan 10 18:36:57 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: 210:waiting_locks_callback()) ### lock callback timer expired: evicting client e23549c4-cd61- [EMAIL PROTECTED] nid [EMAIL PROTECTED] ns: mds-ufhpc-MDT0000_UUID lock: ffff8100bb893680/0x23710b5b59ab4680 lrc: 1/0,0 mode: CR/CR res: 30066994/507692538 bits 0x3 rrc: 309 type: IBT flags: 4000030 remote: 0x4a51d1af1d561084 expref: 13 pid 31402 Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(handler.c: 1498:mds_handle()) operation 101 on unconnected MDS from [EMAIL PROTECTED] Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: 1437:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x10882458/t0 o101-><?>@<?>:-1 lens 512/0 ref 0 fl Interpret:/0/0 rc -107/0 Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: 1437:target_send_reply_msg()) Skipped 12 previous similar messages Jan 10 18:45:17 hpcmds kernel: [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885ec6a0>] :ptlrpc:ldlm_handle_enqueue+0xda0/0x1420 Jan 10 18:45:17 hpcmds kernel: [<ffffffff888430bd>] :mds:mds_handle +0x478d/0x59af Jan 10 18:45:17 hpcmds kernel: [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff888430bd>] :mds:mds_handle +0x478d/0x59af Jan 10 18:45:17 hpcmds kernel: [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] default_wake_function+0x0/0xe Jan 10 18:45:17 hpcmds kernel: [<ffffffff8001c26c>] __mod_timer +0xb0/0xbe Jan 10 18:45:17 hpcmds kernel: [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/0x11 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] default_wake_function+0x0/0xe Jan 10 18:45:17 hpcmds kernel: [<ffffffff8001c26c>] __mod_timer +0xb0/0xbe Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip+0x0/0x11 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/0x11 Jan 10 18:45:17 hpcmds kernel: Jan 10 18:45:17 hpcmds kernel: ll_mdt_100 S ffff81022c2315e0 0 31377 1 31378 31376 [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 Jan 10 18:45:17 hpcmds kernel: (L-TLB) Jan 10 18:45:17 hpcmds kernel: ffff81022c2315e0 ffff81022c231510 ffff81022c231510 000000000000000a Jan 10 18:45:17 hpcmds kernel: ffff8102206c90c0 ffff81013396f040 000abe5ed9a075d0 [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 Jan 10 18:45:17 hpcmds kernel: 0000000000009a43 Jan 10 18:45:17 hpcmds kernel: ffff8102206c92a8 0000000000000002 ffffffff885ceb51 ffffffffffffffff Jan 10 18:45:17 hpcmds kernel: Call Trace: Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] default_wake_function+0x0/0xe Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip+0x0/0x11 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/0x11 Jan 10 18:45:17 hpcmds kernel: Jan 10 18:45:17 hpcmds kernel: ll_mdt_17 S ffff81011b57f6a0 0 26533 1 26534 26532 (L-TLB) Jan 10 18:45:17 hpcmds kernel: ffff81011b57f6a0 [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 Jan 10 18:45:17 hpcmds kernel: ffff81011b57f5d0 ffff81011b57f5d0 000000000000000a Jan 10 18:45:17 hpcmds kernel: ffff810118f48040 ffff8101339010c0 000abe5ed9897be8 0000000000008bd4 Jan 10 18:45:17 hpcmds kernel: ffff810118f48228 0000000000000001 ffffffff885ceb51 ffffffffffffffff Jan 10 18:45:17 hpcmds kernel: Call Trace: Jan 10 18:45:17 hpcmds kernel: [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip+0x0/0x11 Jan 10 18:45:17 hpcmds kernel: Jan 10 18:45:17 hpcmds kernel: [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff800611f7>] schedule_timeout +0x8a/0xad Jan 10 18:45:17 hpcmds kernel: [<ffffffff80092c5e>] process_timeout +0x0/0x5 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 Jan 10 18:45:17 hpcmds kernel: [<ffffffff800611f7>] schedule_timeout +0x8a/0xad Jan 10 18:45:17 hpcmds kernel: [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff80092c5e>] process_timeout +0x0/0x5 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] default_wake_function+0x0/0xe Jan 10 18:45:17 hpcmds kernel: [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/0x90 Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] default_wake_function+0x0/0xe Jan 10 18:45:17 hpcmds kernel: [<ffffffff885e4460>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 Jan 10 18:45:17 hpcmds kernel: [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/0x90 Followed by tons more of the above tracebacks. _______________________________________________ Lustre-discuss mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
