Lundgren, Andrew wrote: > I haven’t done anything yet. The machines seem to reconnect without > intervention. But the problem occurs again later. > > > > *From:* Đào Thị Thảo [mailto:[email protected]] > *Sent:* Tuesday, October 06, 2009 9:43 PM > *To:* Lundgren, Andrew > *Cc:* lustre-discuss > *Subject:* Re: [Lustre-discuss] 1.8.0 Loosing connection to the MDT for > several minutes and then recovering. > > > > hi, > I have the same problem with Lundgren. > I don't understand why it's happen. While, my network is still stable. > This problem is repeated. Some time, client can't connection restored to > service (OSTs or MDS) . My provisional measure is rebooting the node. > Andrew, can you explain more detail and guide how to fix it? > > On Wed, Oct 7, 2009 at 12:20 AM, Lundgren, Andrew > <[email protected] <mailto:[email protected]>> wrote: > > Oh man, that should have read LOSING! > > > > *From:* [email protected] > <mailto:[email protected]> > [mailto:[email protected] > <mailto:[email protected]>] *On Behalf Of > *Lundgren, Andrew > *Sent:* Tuesday, October 06, 2009 11:14 AM > *To:* lustre-discuss > *Subject:* [Lustre-discuss] 1.8.0 Loosing connection to the MDT for > several minutes and then recovering. > > > > We have a few 1.8.0 clusters running. We have seen multiple instances > now where the clients lose connectivity to the MDT. The MDS logs > indicate that there is some sort of problem on the MDT. > > > > The following is a typical output: > > > > Oct 6 02:56:08 mint1502 kernel: LustreError: > 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), > evicting 7523f416-2579-5f49-cd3f-54d2438b8...@net_0x20000ce213b0b_uuid > > Oct 6 02:56:08 mint1502 kernel: r...@ffff8100ac9f4000 > x1314647461000449/t0 > o37->7523f416-2579-5f49-cd3f-54d2438b8...@net_0x20000ce213b0b_uuid:0/0 > lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0 > > Oct 6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent > from content-OST001d-osc to NID 207.123.49...@tcp 7s ago has timed out > (limit 7s). > > Oct 6 02:56:16 mint1502 kernel: LustreError: 166-1: > content-OST001d-osc: Connection to service content-OST001d via nid > 207.123.49...@tcp was lost; in progress operations using this service > will fail. > > Oct 6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 02:56:17 mint1502 kernel: LustreError: 166-1: > content-OST001c-osc: Connection to service content-OST001c via nid > 207.123.49...@tcp was lost; in progress operations using this service > will fail. > > Oct 6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A > client on nid 207.123.49...@tcp was evicted due to a lock blocking > callback to 207.123.49...@tcp timed out: rc -107 > > Oct 6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT0000: A > client on nid 207.123.4...@tcp was evicted due to a lock blocking > callback to 207.123.4...@tcp timed out: rc -107 > > Oct 6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! > [ll_mdt_rdpg_04:28999] > > Oct 6 02:56:18 mint1502 kernel: CPU 2: > > Oct 6 02:56:18 mint1502 kernel: Modules linked in: mds(U) > fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) > mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) > lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) > bonding(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) > xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) > x_tables(U) dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) > backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) > acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) > i2c_i801(U) e1000e(U) sr_mod(U) i2c_core(U) i5000_edac(U) cdrom(U) > pata_acpi(U) shpchp(U) edac_mc(U) serio_raw(U) sg(U) pcspkr(U) > dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) lpfc(U) > scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) > mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) > uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) > > Oct 6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 > Tainted: G 2.6.18-92.1.17.el5_lustre.1.8.0smp #1 > > Oct 6 02:56:18 mint1502 kernel: RIP: 0010:[<ffffffff8858f470>] > [<ffffffff8858f470>] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290 > > Oct 6 02:56:18 mint1502 kernel: RSP: 0018:ffff8104c02ad850 EFLAGS: > 00000206 > > Oct 6 02:56:18 mint1502 kernel: RAX: ffff810448dfd200 RBX: > 000000000000328c RCX: 000000000000ba75 > > Oct 6 02:56:18 mint1502 kernel: RDX: ffffffffffff5e7d RSI: > ffffffff802f0d80 RDI: ffffc200109d78cc > > Oct 6 02:56:18 mint1502 kernel: RBP: ffffffff8860c8f2 R08: > ffff810001016e60 R09: 0000000000000000 > > Oct 6 02:56:18 mint1502 kernel: R10: ffff8100b7a28500 R11: > 0000000000000150 R12: ffff810448dfd200 > > Oct 6 02:56:18 mint1502 kernel: R13: 0000000000000286 R14: > 0000000000000286 R15: ffff8104c02ad7f0 > > Oct 6 02:56:18 mint1502 kernel: FS: 00002b49ff256220(0000) > GS:ffff81051fc53d40(0000) knlGS:0000000000000000 > > Oct 6 02:56:18 mint1502 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 000000008005003b > > Oct 6 02:56:18 mint1502 kernel: CR2: 0000003d82e985d0 CR3: > 0000000000201000 CR4: 00000000000006e0 > > Oct 6 02:56:18 mint1502 kernel: > > Oct 6 02:56:18 mint1502 kernel: Call Trace: > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff885962b8>] > :obdclass:class_disconnect+0x378/0x400 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8860e590>] > :ptlrpc:ldlm_cancel_locks_for_export_cb+0x0/0xc0 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8891b301>] > :mds:mds_disconnect+0x121/0xe30 > > Oct 6 02:56:18 mint1502 kernel: [<ffffffff8003d514>] > lock_timer_base+0x1b/0x3c > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8004a824>] > try_to_del_timer_sync+0x51/0x5a > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff885920e4>] > :obdclass:class_fail_export+0x384/0x4c0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff889150d1>] > :mds:mds_readpage+0x1571/0x18a0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff886538d5>] > :ptlrpc:lustre_msg_set_limit+0x35/0xf0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] > default_wake_function+0x0/0xe > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff88918470>] > :mds:mds_handle+0x2140/0x4c80 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff886539c5>] > :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865867d>] > :ptlrpc:ptlrpc_check_req+0x1d/0x110 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865ac03>] > :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff80062f4b>] > thread_return+0x0/0xdf > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff800891f6>] > __wake_up_common+0x3e/0x68 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865e138>] > :ptlrpc:ptlrpc_main+0x1218/0x13e0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8008abb9>] > default_wake_function+0x0/0xe > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff800b4382>] > audit_syscall_exit+0x31b/0x336 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8865cf20>] > :ptlrpc:ptlrpc_main+0x0/0x13e0 > > Oct 6 02:56:19 mint1502 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > Oct 6 02:56:19 mint1502 kernel: > > > > Those go on for a while until it recovers: > > > > Oct 6 03:01:22 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 03:02:48 mint1502 kernel: Lustre: Request x1312747398018956 sent > from content-OST000a-osc to NID 207.123.49...@tcp 26s ago has timed out > (limit 26s). > > Oct 6 03:02:48 mint1502 kernel: Lustre: Skipped 14 previous similar > messages > > Oct 6 03:03:12 mint1502 kernel: Lustre: > 4713:0:(import.c:508:import_select_connection()) content-OST000a-osc: > tried all connections, increasing latency to 26s > > Oct 6 03:03:12 mint1502 kernel: Lustre: > 4713:0:(import.c:508:import_select_connection()) Skipped 9 previous > similar messages > > Oct 6 03:03:35 mint1502 kernel: Lustre: MGS: haven't heard from client > e5534e18-4e9d-cbf4-7f7f-deda49425fdb (at 207.123.49...@tcp) in 228 > seconds. I think it's dead, and I am evicting it. > > Oct 6 03:03:35 mint1502 kernel: Lustre: Skipped 3 previous similar messages > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 4712:0:(import.c:837:ptlrpc_connect_interpret()) > [email protected] > <mailto:[email protected]>@tcp changed server handle > from 0x25a7c6432e7fffec to 0x25a7c643369c7038 > > Oct 6 03:04:02 mint1502 kernel: but is still in recovery > > Oct 6 03:04:02 mint1502 kernel: LustreError: 167-0: This client was > evicted by content-OST000a; in progress operations using this service > will fail. > > Oct 6 03:04:02 mint1502 kernel: LustreError: Skipped 1 previous similar > message > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 4712:0:(import.c:837:ptlrpc_connect_interpret()) > [email protected] > <mailto:[email protected]>@tcp changed server handle > from 0x25a7c6432e7ffff3 to 0x25a7c643369c703f > > Oct 6 03:04:02 mint1502 kernel: but is still in recovery > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 19512:0:(quota_master.c:1644:mds_quota_recovery()) Only 58/57 OSTs are > active, abort quota recovery > > Oct 6 03:04:02 mint1502 kernel: Lustre: > 19512:0:(quota_master.c:1644:mds_quota_recovery()) Skipped 1 previous > similar message > > Oct 6 03:04:02 mint1502 kernel: Lustre: content-OST000a-osc: Connection > restored to service content-OST000a using nid 207.123.49...@tcp. > > Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message > > Oct 6 03:04:02 mint1502 kernel: Lustre: MDS content-MDT0000: > content-OST000a_UUID now active, resetting orphans > > Oct 6 03:04:02 mint1502 kernel: Lustre: Skipped 1 previous similar message > >
I've just seen it with a 1.8.1 server (and 1.6 clients): Oct 4 18:21:30 sn01 kernel: CPU 5: Oct 4 18:21:30 sn01 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob dclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) dm_mirror(U) dm_log(U) dm_multipath(U) scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) s g(U) shpchp(U) netxen_nic(U) ide_cd(U) serio_raw(U) i5000_edac(U) pcspkr(U) edac_mc(U) bnx2(U) cdrom(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) scsi_transport_sas(U) mp tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Oct 4 18:21:30 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G 2.6.18-128.1.14.el5_lustre.1.8.1 #1 Oct 4 18:21:30 sn01 kernel: RIP: 0010:[<ffffffff88504ba0>] [<ffffffff88504ba0>] :obdclass:lustre_hash_for_each_empty+0x220/0x2b0 Oct 4 18:21:30 sn01 kernel: RSP: 0018:ffff810123af7a30 EFLAGS: 00000202 Oct 4 18:21:30 sn01 kernel: RAX: 000000000000ffff RBX: 0000000000002662 RCX: 00000000000083dc Oct 4 18:21:30 sn01 kernel: RDX: 00000000000002dc RSI: ffffffff802f9e80 RDI: ffffc2001225361c Oct 4 18:21:30 sn01 kernel: RBP: ffffffff885838e2 R08: ffff81000102fee0 R09: 0000000000000000 Oct 4 18:21:30 sn01 kernel: R10: ffff8100aa9d5cc0 R11: 0000000000000150 R12: ffff8100acf36800 Oct 4 18:21:30 sn01 kernel: R13: 0000000000000286 R14: 0000000000000286 R15: ffff810123af79d0 Oct 4 18:21:30 sn01 kernel: FS: 00002ab6a6887230(0000) GS:ffff810106751740(0000) knlGS:0000000000000000 Oct 4 18:21:30 sn01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 4 18:21:30 sn01 kernel: CR2: 00002ab47fe89000 CR3: 0000000000201000 CR4: 00000000000006e0 Oct 4 18:21:30 sn01 kernel: Oct 4 18:21:30 sn01 kernel: Call Trace: Oct 4 18:21:30 sn01 kernel: [<ffffffff8850ba88>] :obdclass:class_disconnect+0x398/0x420 Oct 4 18:21:30 sn01 kernel: [<ffffffff888a65b1>] :mds:mds_disconnect+0x121/0xe40 Oct 4 18:21:30 sn01 kernel: [<ffffffff885c5ea4>] :ptlrpc:lustre_msg_add_version+0x34/0x110 Oct 4 18:21:30 sn01 kernel: [<ffffffff885cb834>] :ptlrpc:lustre_msg_set_timeout+0x34/0x110 Oct 4 18:21:30 sn01 kernel: [<ffffffff885c8a79>] :ptlrpc:lustre_pack_reply+0x29/0xb0 Oct 4 18:21:30 sn01 kernel: [<ffffffff88591381>] :ptlrpc:target_handle_disconnect+0x461/0x560 Oct 4 18:21:30 sn01 kernel: [<ffffffff885c4f35>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0 Oct 4 18:21:30 sn01 kernel: [<ffffffff888a2324>] :mds:mds_handle+0xd74/0x4cb0 Oct 4 18:21:30 sn01 kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28 Oct 4 18:21:30 sn01 kernel: [<ffffffff80088f32>] find_busiest_group+0x20d/0x621 Oct 4 18:21:30 sn01 kernel: [<ffffffff885caa15>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 4 18:21:30 sn01 kernel: [<ffffffff80089d89>] enqueue_task+0x41/0x56 Oct 4 18:21:30 sn01 kernel: [<ffffffff885cf72d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 4 18:21:30 sn01 kernel: [<ffffffff885d1e67>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 Oct 4 18:21:30 sn01 kernel: [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c Oct 4 18:21:30 sn01 kernel: [<ffffffff80088819>] __wake_up_common+0x3e/0x68 Oct 4 18:21:30 sn01 kernel: [<ffffffff885d5908>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 4 18:21:30 sn01 kernel: [<ffffffff8008a3ef>] default_wake_function+0x0/0xe Oct 4 18:21:30 sn01 kernel: [<ffffffff800b48dd>] audit_syscall_exit+0x327/0x342 Oct 4 18:21:30 sn01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 4 18:21:30 sn01 kernel: [<ffffffff885d46f0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 4 18:21:30 sn01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 4 18:21:30 sn01 kernel: Oct 4 18:21:40 sn01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_mdt_39:831] Oct 4 18:21:40 sn01 kernel: CPU 5: Oct 4 18:21:40 sn01 kernel: Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob dclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) dm_mirror(U) dm_log(U) dm_multipath(U) scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) s g(U) shpchp(U) netxen_nic(U) ide_cd(U) serio_raw(U) i5000_edac(U) pcspkr(U) edac_mc(U) bnx2(U) cdrom(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) scsi_transport_sas(U) mp tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Oct 4 18:21:40 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G 2.6.18-128.1.14.el5_lustre.1.8.1 #1 Oct 4 18:21:40 sn01 kernel: RIP: 0010:[<ffffffff88504b98>] [<ffffffff88504b98>] :obdclass:lustre_hash_for_each_empty+0x218/0x2b0 Oct 4 18:21:40 sn01 kernel: RSP: 0018:ffff810123af7a30 EFLAGS: 00000206 Oct 4 18:21:40 sn01 kernel: RAX: 000000000000ffff RBX: 0000000000000543 RCX: 000000000000bf15 Oct 4 18:21:40 sn01 kernel: RDX: 000000000000036e RSI: ffffffff802f9e80 RDI: ffffc2001223243c Oct 4 18:21:40 sn01 kernel: RBP: ffffffff885838e2 R08: ffff81000102fee0 R09: 0000000000000000 Oct 4 18:21:40 sn01 kernel: R10: ffff8100a59c1e00 R11: 0000000000000150 R12: ffff8100a5f6f800 Oct 4 18:21:40 sn01 kernel: R13: 0000000000000286 R14: 0000000000000286 R15: ffff810123af79d0 Oct 4 18:21:40 sn01 kernel: FS: 00002ab6a6887230(0000) GS:ffff810106751740(0000) knlGS:0000000000000000 Oct 4 18:21:40 sn01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 4 18:21:40 sn01 kernel: CR2: 00002ab47fe89000 CR3: 0000000000201000 CR4: 00000000000006e0 Oct 4 18:21:40 sn01 kernel: Oct 4 18:21:40 sn01 kernel: [<ffffffff888a65b1>] :mds:mds_disconnect+0x121/0xe40 Oct 4 18:21:40 sn01 kernel: [<ffffffff885c5ea4>] :ptlrpc:lustre_msg_add_version+0x34/0x110 Oct 4 18:21:40 sn01 kernel: [<ffffffff885cb834>] :ptlrpc:lustre_msg_set_timeout+0x34/0x110 Oct 4 18:21:40 sn01 kernel: [<ffffffff885c8a79>] :ptlrpc:lustre_pack_reply+0x29/0xb0 Oct 4 18:21:40 sn01 kernel: [<ffffffff88591381>] :ptlrpc:target_handle_disconnect+0x461/0x560 Oct 4 18:21:40 sn01 kernel: [<ffffffff885c4f35>] :ptlrpc:lustre_msg_get_opc+0x35/0xf0 Oct 4 18:21:40 sn01 kernel: [<ffffffff888a2324>] :mds:mds_handle+0xd74/0x4cb0 Oct 4 18:21:40 sn01 kernel: [<ffffffff80148d4f>] __next_cpu+0x19/0x28 Oct 4 18:21:40 sn01 kernel: [<ffffffff80088f32>] find_busiest_group+0x20d/0x621 Oct 4 18:21:40 sn01 kernel: [<ffffffff885caa15>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Oct 4 18:21:40 sn01 kernel: [<ffffffff80089d89>] enqueue_task+0x41/0x56 Oct 4 18:21:40 sn01 kernel: [<ffffffff885cf72d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Oct 4 18:21:40 sn01 kernel: [<ffffffff885d1e67>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160 Oct 4 18:21:40 sn01 kernel: [<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c Oct 4 18:21:40 sn01 kernel: [<ffffffff80088819>] __wake_up_common+0x3e/0x68 Oct 4 18:21:40 sn01 kernel: [<ffffffff885d5908>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Oct 4 18:21:40 sn01 kernel: [<ffffffff8008a3ef>] default_wake_function+0x0/0xe Oct 4 18:21:40 sn01 kernel: [<ffffffff800b48dd>] audit_syscall_exit+0x327/0x342 Oct 4 18:21:40 sn01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Oct 4 18:21:40 sn01 kernel: [<ffffffff885d46f0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Oct 4 18:21:40 sn01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Oct 4 18:21:40 sn01 kernel: s ago has timed out (limit 7s). 0/0 rc 0/0 will wait for recovery to complete. Oct 4 18:21:48 sn01 kernel: Lustre: Skipped 1 previous similar message Oct 4 18:21:48 sn01 kernel: Lustre: lustre_0-OST001c-osc: Connection restored to service lustre_0-OST001c using nid 10.1.4....@tcp. Oct 4 18:21:50 sn01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_mdt_39:831] Oct 4 18:21:50 sn01 kernel: CPU 5: tbase(U) megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Oct 4 18:21:50 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G 2.6.18-128.1.14.el5_lustre.1.8.1 #1 Oct 4 18:21:50 sn01 kernel: RIP: 0010:[<ffffffff88504ba0>] [<ffffffff88504ba0>] :obdclass:lustre_hash_for_each_empty+0x220/0x2b0 Oct 4 18:21:50 sn01 kernel: RSP: 0018:ffff810123af7a30 EFLAGS: 00000202 Which repeats several times, before Oct 4 18:24:31 sn01 kernel: Lustre: 831:0:(service.c:1369:ptlrpc_server_handle_request()) @@@ Request x44330119 took longer than estimated (100+90s); client may timeout. r...@ffff810188454c00 x44330119/t0 o39->e64e59dd-a66e-9210-ed51-5bcdebd74...@net_0x200000a010713_uuid:0/0 lens 128/128 e 0 to 0 dl 1254676981 ref 1 fl Complete:/0/0 rc 0/0 Oct 4 18:24:31 sn01 kernel: LustreError: 4417:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 r...@ffff81009432b800 x1310439138925892/t0 o400->lustre_0-OST001 [email protected]@tcp:28/4 lens 192/384 e 0 to 1 dl 1254676908 ref 1 fl Complete:XN/0/0 rc -11/0 Oct 4 18:24:31 sn01 kernel: LustreError: 4417:0:(events.c:66:request_out_callback()) Skipped 6 previous similar messages Oct 4 18:24:31 sn01 kernel: Lustre: 864:0:(ldlm_lib.c:541:target_handle_reconnect()) lustre_0-MDT0000: dbb6a3f0-9aca-15cb-c931-831d85f03526 reconnecting Oct 4 18:24:31 sn01 kernel: Lustre: 885:0:(ldlm_lib.c:766:target_handle_connect()) lustre_0-MDT0000: exp ffff8101888d4200 already connecting Oct 4 18:24:31 sn01 kernel: LustreError: 885:0:(ldlm_lib.c:1850:target_send_reply_msg()) @@@ processing error (-114) r...@ffff8101ad1b6c00 x2087169/t0 o38->dbb6a3f0-9aca-1 5cb-c931-831d85f03...@net_0x200000a01070d_uuid:0/0 lens 304/200 e 0 to 0 dl 1254677171 ref 1 fl Interpret:/0/0 rc -114/0 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4....@tcp portal 4 match 1310439138926075 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4....@tcp portal 4 match 1310439138926074 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4....@tcp portal 4 match 1310439138926073 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4....@tcp portal 4 match 1310439138926133 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4....@tcp portal 4 match 1310439138926135 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4....@tcp portal 4 match 1310439138926136 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4....@tcp portal 4 match 1310439138926134 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4....@tcp portal 4 match 1310439138926184 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4....@tcp portal 4 match 1310439138926186 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4....@tcp portal 4 match 1310439138926185 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: 4417:0:(lib-move.c:1782:lnet_parse_put()) Dropping PUT from 12345-10.1.4....@tcp portal 4 match 1310439138926187 offset 0 length 264: 2 Oct 4 18:24:31 sn01 kernel: Lustre: lustre_0-OST001c-osc: Connection restored to service lustre_0-OST001c using nid 10.1.4....@tcp. Chris _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
