Hi,

I forgot to ask, if your mds is failing over what happens on failover  
mds? Does it recover connections to all clients? Can you also  
describe you failover solution (e.g. heartbeat etc.)?
cheers

Wojciech Turek
On 11 Jan 2008, at 12:18, Charles Taylor wrote:

>
> Were are running a patched (mballoc and others) 1.6.3 lustre on
> x86_64 platform under Centos4.5.   Kernel is ...
>
> Linux submit.ufhpc 2.6.18-8.1.14.el5Lustre #1 SMP Fri Oct 12 15:51:56
> EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
>
> and is the same on the clients and server side.   We have both tcp
> and o2ib NIDS.
>
> It seems to work pretty well most of the time but yesterday a large
> (256-way) job started and it seems that the MDS pretty much just fell
> over.    This occurred at 18:36 (see log below).    The whole cluster
> went into recovery mode as though we had rebooted the MDS.    Prior
> to that we were getting some "target send reply msg" and "no handle
> for file close" errors.
>
> We know that each thread of the 256-way job opens a writes to a file
> independently but roughly concurrently so it probably got hit pretty
> hard all at once.   Nonetheless, we hit it just as hard in our pre-
> production testing and it held up very well.   So the questions are...
>
> 1. Does anyone recognize the errors below as known bugs for which
> there are patches/fixes we missed?
>
> 2. Looking at the tracebacks, I can't help but wonder if there is
> some ldlm parameter we need to tweak for a cluster of our size.
>
> Note that we have 320 o2ib clients and 80 tcp clients (400 nodes/1600
> cores).    Our MDS is a 4-way Opteron 2212 w/ 8GB RAM.     We have 8
> OSS's and 24 OST's (3 per OSS).   Each OSS is a 4-way Opteron 275
> with 4GB RAM and 3 2Gb/s FC HCAs connected to Xyratex 4900FC storage
> arrays.
>
> Any ideas?   Suggestions?
>
> Thanks,
>
> Charlie Taylor
> UF HPC Center
>
> Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> [EMAIL PROTECTED] x183667281/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:34:54 hpcmds kernel: LustreError: 24816:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 2 previous similar messages
> Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30269183: cookie
> 0x23710b5b3ed594eb  [EMAIL PROTECTED] x183716808/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(mds_open.c:
> 1474:mds_close()) Skipped 6 previous similar messages
> Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> [EMAIL PROTECTED] x183716808/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:36:54 hpcmds kernel: LustreError: 24817:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 6 previous similar messages
> Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30298600: cookie
> 0x23710b5b3c868998  [EMAIL PROTECTED] x183809041/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(mds_open.c:
> 1474:mds_close()) Skipped 1 previous similar message
> Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> [EMAIL PROTECTED] x183809041/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:43:43 hpcmds kernel: LustreError: 24824:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 1 previous similar message
> Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30205356: cookie
> 0x23710b5b3dfc4c91  [EMAIL PROTECTED] x183817543/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(mds_open.c:
> 1474:mds_close()) Skipped 5 previous similar messages
> Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> [EMAIL PROTECTED] x183817543/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:44:02 hpcmds kernel: LustreError: 24786:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 5 previous similar messages
> Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 24185989: cookie
> 0x23710b5b3ed9874c  [EMAIL PROTECTED] x183830524/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(mds_open.c:
> 1474:mds_close()) Skipped 1 previous similar message
> Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> [EMAIL PROTECTED] x183830524/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:44:52 hpcmds kernel: LustreError: 24818:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 1 previous similar message
> Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 28959573: cookie
> 0x23710b5b3eda7dd4  [EMAIL PROTECTED] x183871936/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(mds_open.c:
> 1474:mds_close()) Skipped 1 previous similar message
> Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> [EMAIL PROTECTED] x183871936/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:47:00 hpcmds kernel: LustreError: 24798:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 1 previous similar message
> Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30137171: cookie
> 0x23710b5b3c790ddb  [EMAIL PROTECTED] x183904052/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:51:11 hpcmds kernel: LustreError: 3730:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> [EMAIL PROTECTED] x183904052/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:53:00 hpcmds kernel: Lustre: ufhpc-MDT0000: haven't heard
> from client 8514a628-1f97-0759-0932-de3e470cbb4f (at
> [EMAIL PROTECTED]) in 2252 seconds. I think it's dead, and I am
> evicting it.
> Jan 10 13:53:00 hpcmds kernel: Lustre: MGS: haven't heard from client
> 3e041a15-91e3-5c94-78fa-3b511d86bcd9 (at [EMAIL PROTECTED]) in 2252
> seconds. I think it's dead, and I am evicting it.
> Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30296257: cookie
> 0x23710b5b3c18eca5  [EMAIL PROTECTED] x184041701/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(mds_open.c:
> 1474:mds_close()) Skipped 2 previous similar messages
> Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> [EMAIL PROTECTED] x184041701/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 13:58:23 hpcmds kernel: LustreError: 5546:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 2 previous similar messages
> Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 29841036: cookie
> 0x23710b5b3d470bf2  [EMAIL PROTECTED] x184196858/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(mds_open.c:
> 1474:mds_close()) Skipped 4 previous similar messages
> Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> [EMAIL PROTECTED] x184196858/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 14:09:20 hpcmds kernel: LustreError: 24825:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 4 previous similar messages
> Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c:
> 1474:mds_close()) @@@ no handle for file close ino 30101082: cookie
> 0x23710b5b3e8e62f1  [EMAIL PROTECTED] x184423975/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc 0/0
> Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(mds_open.c:
> 1474:mds_close()) Skipped 3 previous similar messages
> Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-116)
> [EMAIL PROTECTED] x184423975/t0 o35-
>> [EMAIL PROTECTED]:-1
> lens 296/1736 ref 0 fl Interpret:/0/0 rc -116/0
> Jan 10 14:23:15 hpcmds kernel: LustreError: 5544:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 3 previous similar messages
> Jan 10 18:36:57 hpcmds kernel: LustreError: 0:0:(ldlm_lockd.c:
> 210:waiting_locks_callback()) ### lock callback timer expired:
> evicting client e23549c4-cd61-
> [EMAIL PROTECTED] nid
> [EMAIL PROTECTED]  ns: mds-ufhpc-MDT0000_UUID lock:
> ffff8100bb893680/0x23710b5b59ab4680 lrc: 1/0,0 mode: CR/CR res:
> 30066994/507692538 bits 0x3 rrc: 309 type: IBT flags: 4000030 remote:
> 0x4a51d1af1d561084 expref: 13 pid 31402
> Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(handler.c:
> 1498:mds_handle()) operation 101 on unconnected MDS from
> [EMAIL PROTECTED]
> Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-107)
> [EMAIL PROTECTED] x10882458/t0 o101-><?>@<?>:-1 lens 512/0 ref 0
> fl Interpret:/0/0 rc -107/0
> Jan 10 18:36:57 hpcmds kernel: LustreError: 31365:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 12 previous similar messages
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885ec6a0>] :ptlrpc:ldlm_handle_enqueue+0xda0/0x1420
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff888430bd>] :mds:mds_handle
> +0x478d/0x59af
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885f04e0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff888430bd>] :mds:mds_handle
> +0x478d/0x59af
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8856bd62>] :obdclass:class_handle2object+0xd2/0x160
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]
> default_wake_function+0x0/0xe
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8001c26c>] __mod_timer
> +0xb0/0xbe
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff8860a030>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be25>] child_rip+0xa/ 
> 0x11
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88607be5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]
> default_wake_function+0x0/0xe
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8001c26c>] __mod_timer
> +0xb0/0xbe
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be1b>] child_rip 
> +0x0/0x11
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be25>] child_rip+0xa/ 
> 0x11
> Jan 10 18:45:17 hpcmds kernel:
> Jan 10 18:45:17 hpcmds kernel: ll_mdt_100    S ffff81022c2315e0     0
> 31377      1         31378 31376
> [<ffffffff88611617>] :ptlrpc:ptlrpc_main+0x13c7/0x1d70
> Jan 10 18:45:17 hpcmds kernel:  (L-TLB)
> Jan 10 18:45:17 hpcmds kernel:  ffff81022c2315e0 ffff81022c231510
> ffff81022c231510 000000000000000a
> Jan 10 18:45:17 hpcmds kernel:  ffff8102206c90c0 ffff81013396f040
> 000abe5ed9a075d0 [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70
> Jan 10 18:45:17 hpcmds kernel:  0000000000009a43
> Jan 10 18:45:17 hpcmds kernel:  ffff8102206c92a8 0000000000000002
> ffffffff885ceb51 ffffffffffffffff
> Jan 10 18:45:17 hpcmds kernel: Call Trace:
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]
> default_wake_function+0x0/0xe
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be1b>] child_rip 
> +0x0/0x11
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be25>] child_rip+0xa/ 
> 0x11
> Jan 10 18:45:17 hpcmds kernel:
> Jan 10 18:45:17 hpcmds kernel: ll_mdt_17     S ffff81011b57f6a0     0
> 26533      1         26534 26532 (L-TLB)
> Jan 10 18:45:17 hpcmds kernel:  ffff81011b57f6a0
> [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0
> Jan 10 18:45:17 hpcmds kernel:  ffff81011b57f5d0 ffff81011b57f5d0
> 000000000000000a
> Jan 10 18:45:17 hpcmds kernel:  ffff810118f48040 ffff8101339010c0
> 000abe5ed9897be8 0000000000008bd4
> Jan 10 18:45:17 hpcmds kernel:  ffff810118f48228 0000000000000001
> ffffffff885ceb51 ffffffffffffffff
> Jan 10 18:45:17 hpcmds kernel: Call Trace:
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff88610250>] :ptlrpc:ptlrpc_main+0x0/0x1d70
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff8005be1b>] child_rip 
> +0x0/0x11
> Jan 10 18:45:17 hpcmds kernel:
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885ceb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff800611f7>] schedule_timeout
> +0x8a/0xad
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80092c5e>] process_timeout
> +0x0/0x5
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885e6650>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff800611f7>] schedule_timeout
> +0x8a/0xad
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80092c5e>] process_timeout
> +0x0/0x5
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885e7aad>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]
> default_wake_function+0x0/0xe
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885d05e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 
> 0x90
> Jan 10 18:45:17 hpcmds kernel:  [<ffffffff80086a74>]
> default_wake_function+0x0/0xe
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885e4460>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
> Jan 10 18:45:17 hpcmds kernel:
> [<ffffffff885cce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 
> 0x90
>
>
> Followed by tons more of the above tracebacks.
>
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> [email protected]
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to