Replaying to myself to resume a bit the bunch of logs sended yesterday. I'm trying to understand the following block of logs:
It seems quite extrange to me that big numbers (for example 21018 times having a refused connection ) of similar messages on so small amount of time, less than a second. Is there any way to ask lustre WHY is refusing that connection? many thanks. Marc Aug 22 18:58:13 s_kernel@n4 kernel: Lustre: 23954:0:(ldlm_lib.c:542:target_handle_reconnect()) scratch-OST0034: 1ccbe619-062d-09df-9e3b-3ff3960bfa6a reconnecting Aug 22 18:58:13 s_kernel@n4 kernel: Lustre: 23954:0:(ldlm_lib.c:542:target_handle_reconnect()) Skipped 21021 previous similar messages Aug 22 18:58:13 s_kernel@n4 kernel: Lustre: 23954:0:(ldlm_lib.c:799:target_handle_connect()) scratch-OST0034: refuse reconnection from [email protected]@tcp to 0xffff81070c578000; still busy with 1 active RPCs Aug 22 18:58:13 s_kernel@n4 kernel: Lustre: 23954:0:(ldlm_lib.c:799:target_handle_connect()) Skipped 21018 previous similar messages Aug 22 18:59:01 s_sys@n4 crond[21701]: (perf) CMD (/opt/perf/bin/ ganglia_metrics.pl -jobs -bond0 &>/dev/null) Aug 22 18:59:17 s_kernel@n4 kernel: Lustre: 24195:0:(ldlm_lib.c:799:target_handle_connect()) scratch-OST0034: refuse reconnection from [email protected]@tcp to 0xffff81070c578000; still busy with 1 active RPCs Aug 22 18:59:17 s_kernel@n4 kernel: Lustre: 24195:0:(ldlm_lib.c:799:target_handle_connect()) Skipped previous similar messages Aug 22 18:59:53 s_kernel@n4 kernel: Lustre: 0:0:(watchdog.c:183:lcw_cb()) Watchdog triggered for pid 13380: it was inactive for 200.00s Aug 22 18:59:53 s_kernel@n4 kernel: Lustre: 0:0:(watchdog.c:183:lcw_cb()) Skipped 4 previous similar messages Aug 22 18:59:53 s_kernel@n4 kernel: Lustre: 0:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 13380 Aug 22 18:59:53 s_kernel@n4 kernel: Lustre: 0:0:(linux-debug.c:222:libcfs_debug_dumpstack()) Skipped 4 previous similar messages Aug 22 18:59:53 s_kernel@n4 kernel: ll_ost_io_49 S ffff81000900c980 0 13380 1 13381 13379 (L-TLB) Aug 22 18:59:53 s_kernel@n4 kernel: ffff81061fd55a10 0000000000000046 ffff8104ca43f400 0000000000000000 Aug 22 18:59:53 s_kernel@n4 kernel: 0000000000000000 000000000000000a ffff8106201db040 ffff8106400677d0 Aug 22 18:59:53 s_kernel@n4 kernel: 00063866b1dac473 0000000000000b17 ffff8106201db228 0000000100000000 Aug 22 18:59:53 s_kernel@n4 kernel: Call Trace: Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff888c5f27>] :ptlrpc:target_queue_recovery_request+0xb37/0x12b0 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff8008a88e>] default_wake_function+0x0/0xe Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff88b5d11b>] :ost:ost_handle+0x37b/0x5cd0 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff80063118>] thread_return+0x62/0xfe Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff800c5d06>] free_pages_bulk+0x1ee/0x266 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff800c9cb4>] zone_statistics+0x3e/0x6d Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff8014a88b>] __next_cpu+0x19/0x28 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff800893ba>] find_busiest_group+0x20d/0x621 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff888f7325>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff8008a228>] enqueue_task+0x41/0x56 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff888ffa6a>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff8890190d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff88903e9e>] :ptlrpc:ptlrpc_server_handle_request+0xa9e/0x1160 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff80063118>] thread_return+0x62/0xfe Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff80088c83>] __wake_up_common+0x3e/0x68 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff88907388>] :ptlrpc:ptlrpc_main+0x1208/0x13d0 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff8008a88e>] default_wake_function+0x0/0xe Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff800b5eed>] audit_syscall_exit+0x31b/0x336 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff8005e01d>] child_rip+0xa/0x11 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff88906180>] :ptlrpc:ptlrpc_main+0x0/0x13d0 Aug 22 18:59:53 s_kernel@n4 kernel: [<ffffffff8005e013>] child_rip+0x0/0x11 Aug 22 18:59:53 s_kernel@n4 kernel: Aug 22 18:59:53 s_kernel@n4 kernel: LustreError: dumping log to /tmp/lustre-log.1314032393.13380 On 25 August 2011 18:37, Joan Marc <[email protected]> wrote: > Hello, > > I'm having something that I can not understand. I quite new on lustre. > > > n4 and n5 are two I/O servers, each one taking care of diferent datasets. > > When one of them goes down, the other is supposed to handle the service as > backup node. But n4 goes down without having heavy loads, and when n5 gets > the service its cpu and memory goes to the top and performance goes to the > bottom. > > Can someone tell me which lines to look at , and from here I can start > checking this issue? > > Here the a 30 minutes syslog with lustre related messages from n4 : > http://pastebin.com/q1iGwDxw > > Here almost the same 30 minutes from n5. > http://pastebin.com/4Bg5repa > > > Many thanks. > > Marc > > > >
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
