We have a few 1.8.0 clusters running. We have seen multiple instances now where the clients loose connectivity to the MDT. The MDS logs indicate that there is some sort of problem on the MDT.
The following is a typical output: Oct 6 02:56:08 mint1502 kernel: LustreError: 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), evicting 7523f416-2579-5f49-cd3f-54d2438b8...@net_0x20000ce213b0b_uuid Oct 6 02:56:08 mint1502 kernel: r...@ffff8100ac9f4000 x1314647461000449/t0 o37->7523f416-2579-5f49-cd3f-54d2438b8...@net_0x20000ce213b0b_uuid:0/0 lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0 Oct 6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent from content-OST001d-osc to NID 207.123.49...@tcp 7s ago has timed out (limit 7s). Oct 6 02:56:16 mint1502 kernel: LustreError: 166-1: content-OST001d-osc: Connection to service content-OST001d via nid 207.123.49...@tcp was lost; in progress operations using this service will fail. Oct 6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 02:56:17 mint1502 kernel: LustreError: 166-1: content-OST001c-osc: Connection to service content-OST001c via nid 207.123.49...@tcp was lost; in progress operations using this service will fail. Oct 6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.49...@tcp was evicted due to a lock blocking callback to 207.123.49...@tcp timed out: rc -107 Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A client on nid 207.123.4...@tcp was evicted due to a lock blocking callback to 207.123.4...@tcp timed out: rc -107 Oct 6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [ll_mdt_rdpg_04:28999] Oct 6 02:56:18 mint1502 kernel: CPU 2: Oct 6 02:56:18 mint1502 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) bonding(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) x_tables(U) dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c_i801(U) e1000e(U) sr_mod(U) i2c_core(U) i5000_edac(U) cdrom(U) pata_acpi(U) shpchp(U) edac_mc(U) serio_raw(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) lpfc(U) scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Oct 6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 Tainted: G 2.6.18-92.1.17.el5_lustre.1.8.0smp #1 Oct 6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>] [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290 Oct 6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850 EFLAGS: 00000206 Oct 6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX: 000000000000328c RCX: 000000000000ba75 Oct 6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI: ffffffff802f0d80 RDI: ffffc200109d78cc Oct 6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08: ffff810001016e60 R09: 0000000000000000 Oct 6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11: 0000000000000150 R12: ffff810448dfd200 Oct 6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14: 0000000000000286 R15: ffff8104c02ad7f0 Oct 6 02:56:18 mint1502 kernel: FS: 00002b49ff256220(0000) GS:ffff81051fc53d40(0000) knlGS:0000000000000000 Oct 6 02:56:18 mint1502 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3: 0000000000201000 CR4: 00000000000006e0 Oct 6 02:56:18 mint1502 kernel: Oct 6 02:56:18 mint1502 kernel: Call Trace: Oct 6 02:56:18 mint1502 kernel: [<ffffffff885962b8>] :obdclass:class_disconnect+0x378/0x400 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8860e590>] :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8891b301>] :mds:mds_disconnect+0x121/0xe30 Oct 6 02:56:18 mint1502 kernel: [<ffffffff8003d514>] lock_timer_base+0x1b/0x3c Oct 6 02:56:19 mint1502 kernel: [<ffffffff8004a824>] try_to_del_timer_sync+0x51/0x5a Oct 6 02:56:19 mint1502 kernel: [<ffffffff885920e4>] :obdclass:class_fail_export+0x384/0x4c0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff889150d1>] :mds:mds_readpage+0x1571/0x18a0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff886538d5>] :ptlrpc:lustre_msg_set_limit+0x35/0xf0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Oct 6 02:56:19 mint1502 kernel: [<ffffffff88918470>] :mds:mds_handle+0x2140/0x4c80 Oct 6 02:56:19 mint1502 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Oct 6 02:56:19 mint1502 kernel: [<ffffffff886539c5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865867d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865ac03>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160 Oct 6 02:56:19 mint1502 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Oct 6 02:56:19 mint1502 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865e138>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Oct 6 02:56:19 mint1502 kernel: [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865cf20>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 6 02:56:19 mint1502 kernel: Those go on for a while until it recovers: Oct 6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent from content-OST000a-osc to NID 207.123.49...@tcp 26s ago has timed out (limit 26s). Oct 6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar messages Oct 6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: tried all connections, increasing latency to 26s Oct 6 03:03:12 mint1502 kernel: Lustre: 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous similar messages Oct 6 03:03:35 mint1502 kernel: Lustre: MGS: haven't heard from client e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49...@tcp) in 228 seconds. I think it's dead, and I am evicting it. Oct 6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar messages Oct 6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) [email protected]@tcp changed server handle from 0x25a7c6432e7fffec to 0x25a7c643369c7038 Oct 6 03:04:02 mint1502 kernel: but is still in recovery Oct 6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was evicted by content-OST000a; in progress operations using this service will fail. Oct 6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: 4712:0:(import.c:837:ptlrpc_connect_interpret()) [email protected]@tcp changed server handle from 0x25a7c6432e7ffff3 to 0x25a7c643369c703f Oct 6 03:04:02 mint1502 kernel: but is still in recovery Oct 6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are active, abort quota recovery Oct 6 03:04:02 mint1502 kernel: Lustre: 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection restored to service content-OST000a using nid 207.123.49...@tcp. Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message Oct 6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000: content-OST000a_UUID now active, resetting orphans Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message Anyone seen this before? -- Andrew
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
