hi All, First off all, I'm sorry, I cannot write a better bug report as I'm far away from the host, and right now there is no remote access. My colleague send this in by email:
Nov 9 19:05:55 node1 kernel: [<ffffffff88591357>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170 Nov 9 19:05:55 node1 kernel: [<ffffffff8003d382>] lock_timer_base+0x1b/0x3c Nov 9 19:05:55 node1 kernel: [<ffffffff8008881d>] __wake_up_common+0x3e/0x68 Nov 9 19:05:55 node1 kernel: [<ffffffff88594e08>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Nov 9 19:05:55 node1 kernel: [<ffffffff8008a3f3>] default_wake_function+0x0/0xe Nov 9 19:05:55 node1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Nov 9 19:05:55 node1 kernel: [<ffffffff88593bf0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Nov 9 19:05:55 node1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Nov 9 19:05:55 node1 kernel: Nov 9 19:05:55 node1 kernel: Lustre: 4381:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.3....@tcp Nov 9 19:06:05 node1 last message repeated 409322 times Nov 9 19:06:05 node1 kernel: BUG: soft lockup - CPU#1 stuck for 10s! [ll_ost_82:4381] Nov 9 19:06:05 node1 kernel: CPU 1: Nov 9 19:06:05 node1 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) mptctl(U) mptbase(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) autofs4(U) lockd(U) sunrpc(U) cpufreq_ondemand(U) acpi_cpufreq(U) freq_table(U) dm_mirror(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U) igb(U) shpchp(U) pcspkr(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) cciss(U) ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Nov 9 19:06:05 node1 kernel: Pid: 4381, comm: ll_ost_82 Tainted: G 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 Nov 9 19:06:05 node1 kernel: RIP: 0010:[<ffffffff80064ae0>] [<ffffffff80064ae0>] _spin_lock+0x3/0xa Nov 9 19:06:05 node1 kernel: RSP: 0018:ffff8102217b7758 EFLAGS: 00000246 Nov 9 19:06:05 node1 kernel: RAX: 0000000000000008 RBX: ffff81022c80b400 RCX: 0000000000000000 Nov 9 19:06:05 node1 kernel: RDX: ffff81023df660a0 RSI: 0000000000000000 RDI: ffff81023df66250 Nov 9 19:06:05 node1 kernel: RBP: ffff81022c80b400 R08: ffff81022c80b530 R09: 0000000000000000 Nov 9 19:06:05 node1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000013 Nov 9 19:06:05 node1 kernel: R13: ffffffff8857327c R14: 0000000500000000 R15: 0000000000000007 Nov 9 19:06:05 node1 kernel: FS: 00002b4cfcfab230(0000) GS:ffff810107ed96c0(0000) knlGS:0000000000000000 Nov 9 19:06:05 node1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 9 19:06:05 node1 kernel: CR2: 00002abf9b119158 CR3: 0000000000201000 CR4: 00000000000006e0 Nov 9 19:06:05 node1 kernel: Nov 9 19:06:05 node1 kernel: Call Trace: Nov 9 19:06:05 node1 kernel: [<ffffffff8857cafc>] :ptlrpc:ptlrpc_queue_wait+0x103c/0x1690 Nov 9 19:06:05 node1 kernel: [<ffffffff8858a515>] :ptlrpc:lustre_msg_set_opc+0x45/0x120 Nov 9 19:06:05 node1 kernel: [<ffffffff88574085>] :ptlrpc:ptlrpc_at_set_req_timeout+0x85/0xd0 Nov 9 19:06:05 node1 kernel: [<ffffffff885748a9>] :ptlrpc:ptlrpc_prep_req_pool+0x619/0x6b0 Nov 9 19:06:05 node1 kernel: [<ffffffff8008a3f3>] default_wake_function+0x0/0xe Nov 9 19:06:05 node1 kernel: [<ffffffff88564196>] :ptlrpc:ldlm_server_glimpse_ast+0x266/0x3b0 Nov 9 19:06:05 node1 kernel: [<ffffffff88570f03>] :ptlrpc:interval_iterate_reverse+0x73/0x240 Nov 9 19:06:05 node1 kernel: [<ffffffff88558f20>] :ptlrpc:ldlm_process_extent_lock+0x0/0xad0 History: The "cluster" is up for approx. 10 days. It has only one MDS and 2 OSS computers. On the second day the node1 locked up with no usable messages on the screen, and similar messages in the log as above. After I restarted it the cluster was running for a week without bigger error messages. On Sunday we removed the servers, so we had to restart them. Today morning node2 locked up, and after some hours node1 also start to give up. The systems are based on CentOS 5.4 with the official packages from Sun. Is there any known bug about this? Thank you, tamas _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
