Hi We have similar problems when running benchIO and metaBench with large number of clients. I found following patches that may help resolve the problem. I haven't tried them yet on our lustre installation but I am going to do that during the weekend.
https://bugzilla.lustre.org/attachment.cgi?id=13540 https://bugzilla.lustre.org/attachment.cgi?id=13739 cheers Wojciech Turek On 11 Jan 2008, at 12:18, Charles Taylor wrote: > > Were are running a patched (mballoc and others) 1.6.3 lustre on > x86_64 platform under Centos4.5. Kernel is ... > > Linux submit.ufhpc 2.6.18-8.1.14.el5Lustre #1 SMP Fri Oct 12 15:51:56 > EDT 2007 x86_64 x86_64 x86_64 GNU/Linux > > and is the same on the clients and server side. We have both tcp > and o2ib NIDS. > > It seems to work pretty well most of the time but yesterday a large > (256-way) job started and it seems that the MDS pretty much just fell > over. This occurred at 18:36 (see log below). The whole cluster > went into recovery mode as though we had rebooted the MDS. Prior > to that we were getting some "target send reply msg" and "no handle > for file close" errors. > > We know that each thread of the 256-way job opens a writes to a file > independently but roughly concurrently so it probably got hit pretty > hard all at once. Nonetheless, we hit it just as hard in our pre- > production testing and it held up very well. So the questions are... > > 1. Does anyone recognize the errors below as known bugs for which > there are patches/fixes we missed? > > 2. Looking at the tracebacks, I can't help but wonder if there is > some ldlm parameter we need to tweak for a cluster of our size. > > Note that we have 320 o2ib clients and 80 tcp clients (400 nodes/1600 > cores). Our MDS is a 4-way Opteron 2212 w/ 8GB RAM. We have 8 > OSS's and 24 OST's (3 per OSS). Each OSS is a 4-way Opteron 275 > with 4GB RAM and 3 2Gb/s FC HCAs connected to Xyratex 4900FC storage > arrays. > > Any ideas? Suggestions? > > Thanks, > > Charlie Taylor > UF HPC Center > > Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > [EMAIL PROTECTED] x183667281/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 2 previous similar messages > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30269183: cookie > 0x23710b5b3ed594eb [EMAIL PROTECTED] x183716808/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c: > 1474:mds_close()) Skipped 6 previous similar messages > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > [EMAIL PROTECTED] x183716808/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 6 previous similar messages > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30298600: cookie > 0x23710b5b3c868998 [EMAIL PROTECTED] x183809041/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c: > 1474:mds_close()) Skipped 1 previous similar message > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > [EMAIL PROTECTED] x183809041/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 1 previous similar message > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30205356: cookie > 0x23710b5b3dfc4c91 [EMAIL PROTECTED] x183817543/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c: > 1474:mds_close()) Skipped 5 previous similar messages > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > [EMAIL PROTECTED] x183817543/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 5 previous similar messages > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 24185989: cookie > 0x23710b5b3ed9874c [EMAIL PROTECTED] x183830524/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c: > 1474:mds_close()) Skipped 1 previous similar message > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > [EMAIL PROTECTED] x183830524/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 1 previous similar message > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 28959573: cookie > 0x23710b5b3eda7dd4 [EMAIL PROTECTED] x183871936/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c: > 1474:mds_close()) Skipped 1 previous similar message > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > [EMAIL PROTECTED] x183871936/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 1 previous similar message > Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30137171: cookie > 0x23710b5b3c790ddb [EMAIL PROTECTED] x183904052/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > [EMAIL PROTECTED] x183904052/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:53:00 hpcmds kernel: Lustre: ufhpc-MDT0000: haven't heard > from client 8514a628-1f97-0759-0932-de3e470cbb4f (at > [EMAIL PROTECTED]) in 2252 seconds. I think it's dead, and I am > evicting it. > Jan 10 13:53:00 hpcmds kernel: Lustre: MGS: haven't heard from client > 3e041a15-91e3-5c94-78fa-3b511d86bcd9 (at [EMAIL PROTECTED]) in 2252 > seconds. I think it's dead, and I am evicting it. > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30296257: cookie > 0x23710b5b3c18eca5 [EMAIL PROTECTED] x184041701/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c: > 1474:mds_close()) Skipped 2 previous similar messages > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > [EMAIL PROTECTED] x184041701/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 2 previous similar messages > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 29841036: cookie > 0x23710b5b3d470bf2 [EMAIL PROTECTED] x184196858/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c: > 1474:mds_close()) Skipped 4 previous similar messages > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > [EMAIL PROTECTED] x184196858/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 4 previous similar messages > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: > 1474:mds_close()) @@@ no handle for file close ino 30101082: cookie > 0x23710b5b3e8e62f1 [EMAIL PROTECTED] x184423975/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0 > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c: > 1474:mds_close()) Skipped 3 previous similar messages > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-116) > [EMAIL PROTECTED] x184423975/t0 o35- >> [EMAIL PROTECTED]:-1 > lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0 > Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 3 previous similar messages > Jan 10 18:36:57 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c: > 210:waiting_locks_callback()) ### lock callback timer expired: > evicting client e23549c4-cd61- > [EMAIL PROTECTED] nid > [EMAIL PROTECTED] ns: mds-ufhpc-MDT0000_UUID lock: > ffff8100bb893680/0x23710b5b59ab4680 lrc: 1/0,0 mode: CR/CR res: > 30066994/507692538 bits 0x3 rrc: 309 type: IBT flags: 4000030 remote: > 0x4a51d1af1d561084 expref: 13 pid 31402 > Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(handler.c: > 1498:mds_handle()) operation 101 on unconnected MDS from > [EMAIL PROTECTED] > Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) @@@ processing error (-107) > [EMAIL PROTECTED] x10882458/t0 o101-><?>@<?>:-1 lens 512/0 ref 0 > fl Interpret:/0/0 rc -107/0 > Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c: > 1437:target_send_reply_msg()) Skipped 12 previous similar messages > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885ec6a0>] :ptlrpc:ldlm_handle_enqueue+0xda0/0x1420 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff888430bd>] :mds:mds_handle > +0x478d/0x59af > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff888430bd>] :mds:mds_handle > +0x478d/0x59af > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8001c26c>] __mod_timer > +0xb0/0xbe > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/ > 0x11 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8001c26c>] __mod_timer > +0xb0/0xbe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip > +0x0/0x11 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/ > 0x11 > Jan 10 18:45:17 hpcmds kernel: > Jan 10 18:45:17 hpcmds kernel: ll_mdt_100 S ffff81022c2315e0 0 > 31377 1 31378 31376 > [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70 > Jan 10 18:45:17 hpcmds kernel: (L-TLB) > Jan 10 18:45:17 hpcmds kernel: ffff81022c2315e0 ffff81022c231510 > ffff81022c231510 000000000000000a > Jan 10 18:45:17 hpcmds kernel: ffff8102206c90c0 ffff81013396f040 > 000abe5ed9a075d0 [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 > Jan 10 18:45:17 hpcmds kernel: 0000000000009a43 > Jan 10 18:45:17 hpcmds kernel: ffff8102206c92a8 0000000000000002 > ffffffff885ceb51 ffffffffffffffff > Jan 10 18:45:17 hpcmds kernel: Call Trace: > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip > +0x0/0x11 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be25>] child_rip+0xa/ > 0x11 > Jan 10 18:45:17 hpcmds kernel: > Jan 10 18:45:17 hpcmds kernel: ll_mdt_17 S ffff81011b57f6a0 0 > 26533 1 26534 26532 (L-TLB) > Jan 10 18:45:17 hpcmds kernel: ffff81011b57f6a0 > [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 > Jan 10 18:45:17 hpcmds kernel: ffff81011b57f5d0 ffff81011b57f5d0 > 000000000000000a > Jan 10 18:45:17 hpcmds kernel: ffff810118f48040 ffff8101339010c0 > 000abe5ed9897be8 0000000000008bd4 > Jan 10 18:45:17 hpcmds kernel: ffff810118f48228 0000000000000001 > ffffffff885ceb51 ffffffffffffffff > Jan 10 18:45:17 hpcmds kernel: Call Trace: > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff8005be1b>] child_rip > +0x0/0x11 > Jan 10 18:45:17 hpcmds kernel: > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff800611f7>] schedule_timeout > +0x8a/0xad > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80092c5e>] process_timeout > +0x0/0x5 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff800611f7>] schedule_timeout > +0x8a/0xad > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80092c5e>] process_timeout > +0x0/0x5 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ > 0x90 > Jan 10 18:45:17 hpcmds kernel: [<ffffffff80086a74>] > default_wake_function+0x0/0xe > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885e4460>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 > Jan 10 18:45:17 hpcmds kernel: > [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ > 0x90 > > > Followed by tons more of the above tracebacks. > > > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
