hi, I have the same problem with Lundgren. I don't understand why it's happen. While, my network is still stable. This problem is repeated. Some time, client can't connection restored to service (OSTs or MDS) . My provisional measure is rebooting the node. Andrew, can you explain more detail and guide how to fix it?
On Wed, Oct 7, 2009 at 12:20 AM, Lundgren, Andrew < [email protected]> wrote: > Oh man, that should have read LOSING! > > > > *From:* [email protected] [mailto: > [email protected]] *On Behalf Of *Lundgren, Andrew > *Sent:* Tuesday, October 06, 2009 11:14 AM > *To:* lustre-discuss > *Subject:* [Lustre-discuss] 1.8.0 Loosing connection to the MDT for > several minutes and then recovering. > > > > We have a few 1.8.0 clusters running. We have seen multiple instances now > where the clients loose connectivity to the MDT. The MDS logs indicate that > there is some sort of problem on the MDT. > > > > The following is a typical output: > > > > Oct 6 02:56:08 mint1502 kernel: LustreError: > 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), > evicting 7523f416-2579-5f49-cd3f-54d2438b8...@net_0x20000ce213b0b_uuid > > Oct 6 02:56:08 mint1502 kernel: r...@ffff8100ac9f4000x1314647461000449/t0 > o37->7523f416-2579-5f49-cd3f-54d2438b8...@net_0x20000ce213b0b_uuid:0/0 > lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0 > > Oct 6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent > from content-OST001d-osc to NID 207.123.49...@tcp 7s ago has timed out > (limit 7s). > > Oct 6 02:56:16 mint1502 kernel: LustreError: 166-1: content-OST001d-osc: > Connection to service content-OST001d via nid 207.123.49...@tcp was lost; > in progress operations using this service will fail. > > Oct 6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 02:56:17 mint1502 kernel: LustreError: 166-1: content-OST001c-osc: > Connection to service content-OST001c via nid 207.123.49...@tcp was lost; > in progress operations using this service will fail. > > Oct 6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A > client on nid 207.123.49...@tcp was evicted due to a lock blocking > callback to 207.123.49...@tcp timed out: rc -107 > > Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A > client on nid 207.123.4...@tcp was evicted due to a lock blocking callback > to 207.123.4...@tcp timed out: rc -107 > > Oct 6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! > [ll_mdt_rdpg_04:28999] > > Oct 6 02:56:18 mint1502 kernel: CPU 2: > > Oct 6 02:56:18 mint1502 kernel: Modules linked in: mds(U) > fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) > lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) > ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) bonding(U) > ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) xt_state(U) > ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) x_tables(U) > dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) backlight(U) > i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) > parport_pc(U) lp(U) parport(U) joydev(U) i2c_i801(U) e1000e(U) sr_mod(U) > i2c_core(U) i5000_edac(U) cdrom(U) pata_acpi(U) shpchp(U) edac_mc(U) > serio_raw(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) > dm_mod(U) usb_storage(U) lpfc(U) scsi_transport_fc(U) ahci(U) ata_piix(U) > libata(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) sd_mod(U) > scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) > > Oct 6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 Tainted: > G 2.6.18-92.1.17.el5_lustre.1.8.0smp #1 > > Oct 6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>] > [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290 > > Oct 6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850 EFLAGS: > 00000206 > > Oct 6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX: > 000000000000328c RCX: 000000000000ba75 > > Oct 6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI: > ffffffff802f0d80 RDI: ffffc200109d78cc > > Oct 6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08: > ffff810001016e60 R09: 0000000000000000 > > Oct 6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11: > 0000000000000150 R12: ffff810448dfd200 > > Oct 6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14: > 0000000000000286 R15: ffff8104c02ad7f0 > > Oct 6 02:56:18 mint1502 kernel: FS: 00002b49ff256220(0000) > GS:ffff81051fc53d40(0000) knlGS:0000000000000000 > > Oct 6 02:56:18 mint1502 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 000000008005003b > > Oct 6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3: > 0000000000201000 CR4: 00000000000006e0 > > Oct 6 02:56:18 mint1502 kernel: > > Oct 6 02:56:18 mint1502 kernel: Call Trace: > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff885962b8>] > :obdclass:class_disconnect+0x378/0x400 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8860e590>] > :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8891b301>] > :mds:mds_disconnect+0x121/0xe30 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8003d514>] > lock_timer_base+0x1b/0x3c > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8004a824>] > try_to_del_timer_sync+0x51/0x5a > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff885920e4>] > :obdclass:class_fail_export+0x384/0x4c0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff889150d1>] > :mds:mds_readpage+0x1571/0x18a0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff886538d5>] > :ptlrpc:lustre_msg_set_limit+0x35/0xf0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] > default_wake_function+0x0/0xe > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff88918470>] > :mds:mds_handle+0x2140/0x4c80 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff886539c5>] > :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865867d>] > :ptlrpc:ptlrpc_check_req+0x1d/0x110 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865ac03>] > :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff80062f4b>] > thread_return+0x0/0xdf > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff800891f6>] > __wake_up_common+0x3e/0x68 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865e138>] > :ptlrpc:ptlrpc_main+0x1218/0x13e0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] > default_wake_function+0x0/0xe > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff800b4382>] > audit_syscall_exit+0x31b/0x336 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865cf20>] > :ptlrpc:ptlrpc_main+0x0/0x13e0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > Oct 6 02:56:19 mint1502 kernel: > > > > Those go on for a while until it recovers: > > > > Oct 6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent > from content-OST000a-osc to NID 207.123.49...@tcp 26s ago has timed out > (limit 26s). > > Oct 6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar > messages > > Oct 6 03:03:12 mint1502 kernel: Lustre: > 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: tried > all connections, increasing latency to 26s > > Oct 6 03:03:12 mint1502 kernel: Lustre: > 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous similar > messages > > Oct 6 03:03:35 mint1502 kernel: Lustre: MGS: haven't heard from client > e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49...@tcp) in 228 > seconds. I think it's dead, and I am evicting it. > > Oct 6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar > messages > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 4712:0:(import.c:837:ptlrpc_connect_interpret()) > [email protected]@tcp changed server handle from > 0x25a7c6432e7fffec to 0x25a7c643369c7038 > > Oct 6 03:04:02 mint1502 kernel: but is still in recovery > > Oct 6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was > evicted by content-OST000a; in progress operations using this service will > fail. > > Oct 6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 4712:0:(import.c:837:ptlrpc_connect_interpret()) > [email protected]@tcp changed server handle from > 0x25a7c6432e7ffff3 to 0x25a7c643369c703f > > Oct 6 03:04:02 mint1502 kernel: but is still in recovery > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are > active, abort quota recovery > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous > similar message > > Oct 6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection > restored to service content-OST000a using nid 207.123.49...@tcp. > > Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message > > Oct 6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000: > content-OST000a_UUID now active, resetting orphans > > Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message > > > > Anyone seen this before? > > > > -- > > Andrew > > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
