On Tue, May 20, 2008 at 10:24:25PM +0200, Jakob Goldbach wrote: >> Hm, so you actually have a circular loop? >Yes - I've asked for help on the OpenVZ list as well - Pavel Emelyanov >provided me with a debug patch. This patch has now confirmed the loop in >__d_lookup.
we're also seeing __d_lookup soft lockups. 4 are attached. one was a cached_lookup but the first line of that fn is a __d_lookup which is where I suspect is where the real soft lockup occurred. after a while the node is toast and has to be rebooted. kernel is 2.6.23.17 with patchless lustre 1.6.4.3, modified with patches from bz 14250 (attachment 14109) and 13378 (attachment 12276). using o2ib, x86_64, rhel5.1 cheers, robin ps. I'm glad to see some Lustre support for modern linux kernels and would like to see a lot more! the VM's of distro kernel sometimes behave erratically enough that we can't use them on Lustre client nodes in production.
May 6 14:36:53 x6 kernel: BUG: soft lockup - CPU#2 stuck for 11s! [rsync:14200] May 6 14:36:53 x6 kernel: CPU 2: May 6 14:36:53 x6 kernel: Modules linked in: ext3 jbd loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery ac sg sd_mod i2c_i801 i2c_core ahci ata_piix libata ib_mthca scsi_mod i5000_edac ib_mad edac_core ib_core ehci_hcd shpchp uhci_hcd rng_core button nfs nfs_acl lockd sunrpc e1000 May 6 14:36:53 x6 kernel: Pid: 14200, comm: rsync Not tainted 2.6.23.17 #1 May 6 14:36:53 x6 kernel: RIP: 0010:[<ffffffff80282cef>] [<ffffffff80282cef>] __d_lookup+0xed/0x110 May 6 14:36:53 x6 kernel: RSP: 0018:ffff81011b7ebbe8 EFLAGS: 00000286 May 6 14:36:53 x6 kernel: RAX: ffff81011d71aea0 RBX: ffff81011d71aea0 RCX: 0000000000000014 May 6 14:36:53 x6 kernel: RDX: 00000000000cb941 RSI: ffff81011b7ebcb8 RDI: ffff810120522cb8 May 6 14:36:53 x6 kernel: RBP: ffff8101d2707408 R08: 0000000000000007 R09: 0000000000000007 May 6 14:36:53 x6 kernel: R10: 00007fffffffb750 R11: ffffffff802c7d3e R12: ffffffff8853dc60 May 6 14:36:53 x6 kernel: R13: ffff8100992a5470 R14: ffff81024adb3af8 R15: ffff810219defe80 May 6 14:36:53 x6 kernel: FS: 00002aaaab6296e0(0000) GS:ffff81025fc6d840(0000) knlGS:0000000000000000 May 6 14:36:53 x6 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b May 6 14:36:53 x6 kernel: CR2: 000000000079f018 CR3: 000000018c1d0000 CR4: 00000000000006e0 May 6 14:36:53 x6 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 May 6 14:36:54 x6 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 May 6 14:36:54 x6 kernel: May 6 14:36:54 x6 kernel: Call Trace: May 6 14:36:54 x6 kernel: [<ffffffff80282cd5>] __d_lookup+0xd3/0x110 May 6 14:36:54 x6 kernel: [<ffffffff80279b1e>] do_lookup+0x2a/0x1ae May 6 14:36:54 x6 kernel: [<ffffffff8027bc4d>] __link_path_walk+0x924/0xde9 May 6 14:36:54 x6 kernel: [<ffffffff8027c16a>] link_path_walk+0x58/0xe0 May 6 14:36:54 x6 kernel: [<ffffffff8027c536>] do_path_lookup+0x1ab/0x1cf May 6 14:36:54 x6 kernel: [<ffffffff8027afe4>] getname+0x14c/0x191 May 6 14:36:54 x6 kernel: [<ffffffff8027cd67>] __user_walk_fd+0x37/0x53 May 6 14:36:54 x6 kernel: [<ffffffff80275a7b>] vfs_lstat_fd+0x18/0x47 May 6 14:36:54 x6 kernel: [<ffffffff80275c6d>] sys_newlstat+0x19/0x31 May 6 14:36:54 x6 kernel: [<ffffffff8020b3ae>] system_call+0x7e/0x83 May 8 09:32:36 x10 kernel: BUG: soft lockup - CPU#1 stuck for 11s! [bonnie++.mpi:31720] May 8 09:32:36 x10 kernel: CPU 1: May 8 09:32:36 x10 kernel: Modules linked in: loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery ac sg sd_mod i5000_edac edac_core ehci_hcd ahci rng_core ata_piix libata i2c_i801 scsi_mod i2c_core ib_mthca ib_mad shpchp uhci_hcd ib_core button nfs nfs_acl lockd sunrpc e1000 May 8 09:32:36 x10 kernel: Pid: 31720, comm: bonnie++.mpi Not tainted 2.6.23.17 #1 May 8 09:32:36 x10 kernel: RIP: 0010:[<ffffffff80282cef>] [<ffffffff80282cef>] __d_lookup+0xed/0x110 May 8 09:32:36 x10 kernel: RSP: 0018:ffff81014248fd78 EFLAGS: 00000286 May 8 09:32:36 x10 kernel: RAX: ffff8101a79cc500 RBX: ffff8101a79cc500 RCX: 0000000000000014 May 8 09:32:36 x10 kernel: RDX: 00000000000ac16f RSI: ffff81014248feb8 RDI: ffff8101ff617898 May 8 09:32:36 x10 kernel: RBP: ffff8101ff617898 R08: 0000000000000000 R09: ffff81025fd3d5c0 May 8 09:32:37 x10 kernel: R10: 0000000000000001 R11: ffffffff802c7d3e R12: ffffffff8027c1e0 May 8 09:32:37 x10 kernel: R13: ffff81014248fea8 R14: 0000000000000000 R15: ffff81025fd3d5c0 May 8 09:32:37 x10 kernel: FS: 00002aaaac95a990(0000) GS:ffff81025fc6de40(0000) knlGS:0000000000000000 May 8 09:32:37 x10 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b May 8 09:32:37 x10 kernel: CR2: 00002aaaab1abe20 CR3: 00000001b5525000 CR4: 00000000000006e0 May 8 09:32:37 x10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 May 8 09:32:37 x10 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 May 8 09:32:37 x10 kernel: May 8 09:32:37 x10 kernel: Call Trace: May 8 09:32:37 x10 kernel: [<ffffffff80279cba>] cached_lookup+0x18/0x86 May 8 09:32:37 x10 kernel: [<ffffffff80279d87>] lookup_hash+0x5f/0xb5 May 8 09:32:37 x10 kernel: [<ffffffff8027d146>] open_namei+0xf4/0x631 May 8 09:32:37 x10 kernel: [<ffffffff802738b1>] file_move+0x1d/0x51 May 8 09:32:37 x10 kernel: [<ffffffff8027148a>] __dentry_open+0x101/0x1ad May 8 09:32:37 x10 kernel: [<ffffffff802715dc>] do_filp_open+0x1c/0x38 May 8 09:32:37 x10 kernel: [<ffffffff80271371>] get_unused_fd_flags+0x115/0x126 May 8 09:32:37 x10 kernel: [<ffffffff8027163e>] do_sys_open+0x46/0xca May 8 09:32:37 x10 kernel: [<ffffffff8020b3ae>] system_call+0x7e/0x83 May 16 06:10:27 x3 kernel: BUG: soft lockup - CPU#1 stuck for 11s! [python:2684] May 16 06:10:27 x3 kernel: CPU 1: May 16 06:10:27 x3 kernel: Modules linked in: loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery ac sg sd_mod i5000_edac rng_core ata_piix ahci edac_core ib_mthca uhci_hcd i2c_i801 libata i2c_core ib_mad scsi_mod ib_core button shpchp ehci_hcd nfs nfs_acl lockd sunrpc e1000 May 16 06:10:28 x3 kernel: Pid: 2684, comm: python Not tainted 2.6.23.17 #1 May 20 12:27:55 x14 kernel: BUG: soft lockup - CPU#0 stuck for 11s! [ps:18064] May 20 12:27:55 x14 kernel: CPU 0: May 20 12:27:55 x14 kernel: Modules linked in: loop osc mgc lustre lov lquota mdc ko2iblnd ptlrpc obdclass lnet lvfs libcfs rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_srp ib_ipoib ib_cm ib_sa ib_uverbs ib_umad binfmt_misc dm_mirror dm_multipath dm_mod battery ac sg sd_mod i5000_edac edac_core rng_core ata_piix ehci_hcd ahci libata ib_mthca shpchp uhci_hcd scsi_mod ib_mad i2c_i801 ib_core i2c_core button nfs nfs_acl lockd sunrpc e1000 May 20 12:27:55 x14 kernel: Pid: 18064, comm: ps Not tainted 2.6.23.17 #1 May 20 12:27:55 x14 kernel: RIP: 0010:[<ffffffff80282cef>] [<ffffffff80282cef>] __d_lookup+0xed/0x110 May 20 12:27:55 x14 kernel: RSP: 0018:ffff81023bea7bb8 EFLAGS: 00000286 May 20 12:27:55 x14 kernel: RAX: ffff81018ce277a0 RBX: ffff81018ce277a0 RCX: 0000000000000014 May 20 12:27:55 x14 kernel: RDX: 000000000008af04 RSI: ffff81023bea7c88 RDI: ffff810202099dd8 May 20 12:27:55 x14 kernel: RBP: ffff8101c18c49c0 R08: 0000000000000007 R09: 0000000000000007 May 20 12:27:55 x14 kernel: R10: ffff81025fef7a18 R11: ffffffff802c7d3e R12: ffffffff88061600 May 20 12:27:55 x14 kernel: R13: ffffffff88061640 R14: ffff8101c18c49c0 R15: ffffffff8045a22d May 20 12:27:55 x14 kernel: FS: 00002aaaab43cf30(0000) GS:ffffffff804d4000(0000) knlGS:0000000000000000 May 20 12:27:55 x14 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b May 20 12:27:55 x14 kernel: CR2: 00002aaaab17a548 CR3: 000000012dad3000 CR4: 00000000000006e0 May 20 12:27:56 x14 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 May 20 12:27:56 x14 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 May 20 12:27:56 x14 kernel: May 20 12:27:56 x14 kernel: Call Trace: May 20 12:27:56 x14 kernel: [<ffffffff80282cd5>] __d_lookup+0xd3/0x110 May 20 12:27:56 x14 kernel: [<ffffffff80279b1e>] do_lookup+0x2a/0x1ae May 20 12:27:56 x14 kernel: [<ffffffff8027bc4d>] __link_path_walk+0x924/0xde9 May 20 12:27:56 x14 kernel: [<ffffffff8027c16a>] link_path_walk+0x58/0xe0 May 20 12:27:56 x14 kernel: [<ffffffff80271371>] get_unused_fd_flags+0x115/0x126 May 20 12:27:56 x14 kernel: [<ffffffff802d742e>] bitmap_scnprintf+0x8d/0xac May 20 12:27:56 x14 kernel: [<ffffffff8027c536>] do_path_lookup+0x1ab/0x1cf May 20 12:27:56 x14 kernel: [<ffffffff8027cf26>] __path_lookup_intent_open+0x56/0x96 May 20 12:27:56 x14 kernel: [<ffffffff8027d0ca>] open_namei+0x78/0x631 May 20 12:27:56 x14 kernel: [<ffffffff80257561>] __alloc_pages+0x6c/0x2c1 May 20 12:27:56 x14 kernel: [<ffffffff802715dc>] do_filp_open+0x1c/0x38 May 20 12:27:56 x14 kernel: [<ffffffff80271371>] get_unused_fd_flags+0x115/0x126 May 20 12:27:56 x14 kernel: [<ffffffff8027163e>] do_sys_open+0x46/0xca May 20 12:27:56 x14 kernel: [<ffffffff8020b3ae>] system_call+0x7e/0x83 May 20 12:27:56 x14 kernel:
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
