On Sat, Jul 07, 2007 at 04:24:16PM -0700, Roland Dreier wrote: > But I don't understand how we could have reached mthca_alloc_icm() > without getting through mthca_QUERY_FW and printing the FW version > first... are you sure you're getting all the trace messages? How are > you collecting them? Can you make sure that your console level is set > so that you see messages printed with KERN_DEBUG?
You are right, the console did not receive debug messages so I changed mthca_dbg to spam with KERN_ERR priority instead. (This time, it looks like corruption gets triggered at another place and the driver complains to not receive IRQ). Here is the result: # insmod ib_mthca.ko debug_level=1 ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) ib_mthca: Initializing 0000:08:00.0 PCI: Enabling device 0000:08:00.0 (0000 -> 0002) ib_mthca 0000:08:00.0: FW version 000100020000, max commands 16 ib_mthca 0000:08:00.0: Catastrophic error buffer at 0xb9382a50, size 0x10 ib_mthca 0000:08:00.0: FW supports commands through doorbells ib_mthca 0000:08:00.0: Mapped doorbell page for posting FW commands ib_mthca 0000:08:00.0: FW size 5136 KB ib_mthca 0000:08:00.0: Clear int § b93f00d8, EQ arm § b9361748, EQ set CI § b9372000 ib_mthca 0000:08:00.0: No HCA-attached memory (running in MemFree mode) ib_mthca 0000:08:00.0: Mapped 1284 chunks/5136 KB for FW. ib_mthca 0000:08:00.0: Base MM extensions: no ib_mthca 0000:08:00.0: Max ICM size 523264 MB ib_mthca 0000:08:00.0: Max QPs: 16777216, reserved QPs: 1024, entry size: 256 ib_mthca 0000:08:00.0: Max SRQs: 1024, reserved SRQs: 64, entry size: 32 ib_mthca 0000:08:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 64 ib_mthca 0000:08:00.0: Max EQs: 64, reserved EQs: 1, entry size: 64 ib_mthca 0000:08:00.0: reserved MPTs: 16, reserved MTTs: 2 ib_mthca 0000:08:00.0: Max PDs: 8388608, reserved PDs: 4, reserved UARs: 1 ib_mthca 0000:08:00.0: Max QP/MCG: 8388608, reserved MGMs: 0 ib_mthca 0000:08:00.0: Max CQEs: 131072, max WQEs: 16384, max SRQ WQEs: 16384 ib_mthca 0000:08:00.0: Flags: 00370347 ib_mthca 0000:08:00.0: profile 0--13/11 § 0x 0 (size 0x20000000) ib_mthca 0000:08:00.0: profile 1--10/20 § 0x 20000000 (size 0x 4000000) ib_mthca 0000:08:00.0: profile 2-- 0/16 § 0x 24000000 (size 0x 1000000) ib_mthca 0000:08:00.0: profile 3-- 7/18 § 0x 25000000 (size 0x 800000) ib_mthca 0000:08:00.0: profile 4-- 9/17 § 0x 25800000 (size 0x 800000) ib_mthca 0000:08:00.0: profile 5-- 3/16 § 0x 26000000 (size 0x 400000) ib_mthca 0000:08:00.0: profile 6-- 4/16 § 0x 26400000 (size 0x 400000) ib_mthca 0000:08:00.0: profile 7-- 8/13 § 0x 26800000 (size 0x 80000) ib_mthca 0000:08:00.0: profile 8--11/11 § 0x 26880000 (size 0x 10000) ib_mthca 0000:08:00.0: profile 9-- 2/10 § 0x 26890000 (size 0x 8000) ib_mthca 0000:08:00.0: profile10-- 1/ 0 § 0x 26898000 (size 0x 1000) ib_mthca 0000:08:00.0: profile11-- 5/ 0 § 0x 26899000 (size 0x 1000) ib_mthca 0000:08:00.0: profile12-- 6/ 5 § 0x 2689a000 (size 0x 1000) ib_mthca 0000:08:00.0: profile13--12/ 0 § 0x 2689b000 (size 0x 1000) ib_mthca 0000:08:00.0: HCA context memory: reserving 631408 KB ib_mthca 0000:08:00.0: 631408 KB of HCA context requires 1244 KB aux memory. ib_mthca 0000:08:00.0: Mapped 311 chunks/1244 KB for ICM aux. ib_mthca 0000:08:00.0: Mapped page at 24d8b000 to 2689a000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 20000000 for ICM. ib_mthca 0000:08:00.0: Mapped 1 chunks/256 KB at 25800000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 24000000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26400000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26000000 for ICM. ib_mthca 0000:08:00.0: Mapped 8 chunks/32 KB at 26890000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26800000 for ICM. ib_mthca 0000:08:00.0: Mapped 64 chunks/256 KB at 26840000 for ICM. Unable to handle kernel paging request at 0000001100000019 RIP: <ffffffff803654eb> datagram_poll+0xcc/0xd6 PGD 0 Oops: 0002 1 SMP CPU 0 Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core Pid: 2170, comm: ntpd Not tainted 2.6.18-xen31-smp #6 RIP: e030:<ffffffff803654eb> <ffffffff803654eb> datagram_poll+0xcc/0xd6 RSP: e02b:ffff880095e87a88 EFLAGS: 00010246 RAX: 0000001100000011 RBX: ffff8800971e2ac8 RCX: 000000000000000b RDX: 0000000000000000 RSI: 0000000000000049 RDI: 0000000000000002 RBP: ffff88009c8ad390 R08: ffff880095e86000 R09: ffff880095e87760 R10: ffffffff803a492f R11: ffffffff803a492f R12: 0000000000000005 R13: 0000000000000020 R14: ffff880095e87ef8 R15: 0000000000000008 FS: 00002aaaab383ee0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process ntpd (pid: 2170, threadinfo ffff880095e86000, task ffff88009c0977e0) Stack: ffff8800973547b0 ffffffff803a4942 ffffffff803a492f 0000000000000300 ffff88009c8ad390 0000000000000005 0000000000000020 ffffffff8028e1d3 ffff880095e87f40 0000000000000000 ffff880095e87e10 ffff880095e87e18 Call Trace: <ffffffff803a4942> udp_poll+0x13/0xf3 <ffffffff803a492f> udp_poll+0x0/0xf3 <ffffffff8028e1d3> do_select+0x2aa/0x464 <ffffffff8028de4c> __pollwait+0x0/0xdd <ffffffff8022536b> default_wake_function+0x0/0xe <ffffffff8022536b> default_wake_function+0x0/0xe <ffffffff8022536b> default_wake_function+0x0/0xe <ffffffff8022536b> default_wake_function+0x0/0xe <ffffffff8035ff8f> sock_common_recvmsg+0x2d/0x43 <ffffffff8035f7da> sock_recvmsg+0x101/0x120 <ffffffff80277073> poison_obj+0x24/0x2d <ffffffff80277275> cache_free_debugcheck+0x1f9/0x209 <ffffffff803a492f> udp_poll+0x0/0xf3 <ffffffff80277cea> kmem_cache_free+0xd0/0x140 <ffffffff8028e600> sys_select+0x273/0x3e5 <ffffffff8020fd0c> init_fpu+0x62/0x7f <ffffffff8020ab90> math_state_restore+0x21/0x4a <ffffffff8020a117> error_exit+0x0/0x71 <ffffffff80208f99> sys_rt_sigreturn+0x251/0x301 <ffffffff802099da> system_call+0x86/0x8b <ffffffff80209954> system_call+0x0/0x8b Code: f0 0f ba 68 08 00 5b 89 f0 c3 41 57 41 89 f7 41 56 41 55 41 RIP <ffffffff803654eb> datagram_poll+0xcc/0xd6 RSP <ffff880095e87a88> CR2: 0000001100000019 <1>Unable to handle kernel paging request at 0000000d0000001d RIP: <ffffffff881484e3> :xfs:xfs_file_close+0x1c/0x28 PGD 0 Oops: 0000 2 SMP CPU 0 Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core Pid: 2170, comm: ntpd Not tainted 2.6.18-xen31-smp #6 RIP: e030:<ffffffff881484e3> <ffffffff881484e3> :xfs:xfs_file_close+0x1c/0x28 RSP: e02b:ffff880095e87828 EFLAGS: 00010246 RAX: ffff8800971e4078 RBX: ffff88009cbe0bd0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000d0000000d RBP: ffff880000b1e998 R08: ffff880000c39280 R09: 0000000000000298 R10: ffff880097eb1860 R11: 0000000000000298 R12: ffff880000b1e9a8 R13: 0000000000000009 R14: 0000000000000000 R15: 0000000000000001 FS: 00002aaaab383ee0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process ntpd (pid: 2170, threadinfo ffff880095e86000, task ffff88009c0977e0) Stack: ffffffff881484c7 ffffffff8027aa6a ffff880000b1e998 0000000000000001 ffff880000b1e9a8 ffffffff8022c9f1 000009c000000a00 ffff880000b1e998 ffff88009c0977e0 0000000000000001 0000000000000009 ffff880095e879d8 Call Trace: <ffffffff881484c7> :xfs:xfs_file_close+0x0/0x28 <ffffffff8027aa6a> filp_close+0x36/0x64 <ffffffff8022c9f1> put_files_struct+0x6c/0xbf <ffffffff8022dcab> do_exit+0x2ae/0x929 <ffffffff8020622a> hypercall_page+0x22a/0x1000 <ffffffff803c6783> do_page_fault+0x119e/0x1253 <ffffffff8020de8b> monotonic_clock+0x3e/0x86 <ffffffff803c1abd> thread_return+0x0/0x13d <ffffffff8020a117> error_exit+0x0/0x71 <ffffffff803a492f> udp_poll+0x0/0xf3 <ffffffff803a492f> udp_poll+0x0/0xf3 <ffffffff803654eb> datagram_poll+0xcc/0xd6 <ffffffff80365440> datagram_poll+0x21/0xd6 <ffffffff803a4942> udp_poll+0x13/0xf3 <ffffffff803a492f> udp_poll+0x0/0xf3 <ffffffff8028e1d3> do_select+0x2aa/0x464 <ffffffff8028de4c> __pollwait+0x0/0xdd <ffffffff8022536b> default_wake_function+0x0/0xe <ffffffff8022536b> default_wake_function+0x0/0xe <ffffffff8022536b> default_wake_function+0x0/0xe <ffffffff8022536b> default_wake_function+0x0/0xe <ffffffff8035ff8f> sock_common_recvmsg+0x2d/0x43 <ffffffff8035f7da> sock_recvmsg+0x101/0x120 <ffffffff80277073> poison_obj+0x24/0x2d <ffffffff80277275> cache_free_debugcheck+0x1f9/0x209 <ffffffff803a492f> udp_poll+0x0/0xf3 <ffffffff80277cea> kmem_cache_free+0xd0/0x140 <ffffffff8028e600> sys_select+0x273/0x3e5 <ffffffff8020fd0c> init_fpu+0x62/0x7f <ffffffff8020ab90> math_state_restore+0x21/0x4a <ffffffff8020a117> error_exit+0x0/0x71 <ffffffff80208f99> sys_rt_sigreturn+0x251/0x301 <ffffffff802099da> system_call+0x86/0x8b <ffffffff80209954> system_call+0x0/0x8b Code: 48 8b 47 10 ff 50 10 41 5b f7 d8 c3 31 c0 48 83 ff 28 51 74 RIP <ffffffff881484e3> :xfs:xfs_file_close+0x1c/0x28 RSP <ffff880095e87828> CR2: 0000000d0000001d <1>Fixing recursive fault but reboot is needed! syslog-ng1981: segfault at ffffffff80808080 rip 000055555555bc28 rsp 00007fffffffd868 error 6 Slab corruption: start=ffff880096c743b8, len=256 Redzone: 0x1600000016/0x1700000017. Last user: <0000001800000018>(0x1800000018) 000: 17 00 00 00 17 00 00 00 18 00 00 00 18 00 00 00 010: 19 00 00 00 19 00 00 00 1a 00 00 00 1a 00 00 00 020: 1b 00 00 00 1b 00 00 00 1c 00 00 00 1c 00 00 00 030: 1d 00 00 00 1d 00 00 00 1e 00 00 00 1e 00 00 00 040: 1f 00 00 00 1f 00 00 00 00 00 00 00 00 00 00 00 050: 01 00 00 00 01 00 00 00 02 00 00 00 02 00 00 00 Prev obj: start=0000000396c7420b, len=256 Unable to handle kernel paging request at 0000000396c7430b RIP: <ffffffff80277313> print_objinfo+0x22/0xde PGD 0 Oops: 0000 3 SMP CPU 0 Modules linked in: ib_mthca nfs lockd nfs_acl sunrpc ib_ipoib ib_cm ib_sa ib_mad ib_core memtrack ipv6 e1000 dm_mod parport_pc lp parport xfs ata_piix ahci piix mptsas mptscsih mptbase scsi_transport_sas raid0 sata_nv libata amd74xx sd_mod scsi_mod ide_disk ide_core Pid: 1981, comm: syslog-ng Not tainted 2.6.18-xen31-smp #6 RIP: e030:<ffffffff80277313> <ffffffff80277313> print_objinfo+0x22/0xde RSP: e02b:ffff8800935cdb48 EFLAGS: 00010206 RAX: 0000000396c7430b RBX: 000000000089dac1 RCX: ffffffffff57c000 RDX: 0000000000000002 RSI: 0000000396c74203 RDI: ffff8800015f20c0 RBP: ffff8800015f20c0 R08: ffff880000cbc788 R09: 000000000000d64e R10: ffff8800935cda98 R11: ffffffff802fd0b5 R12: 0000000396c74203 R13: 0000000000000002 R14: ffff880096c743b0 R15: ffff880096c74000 FS: 00002aaaab0186e0(0000) GS:ffffffff804aa000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process syslog-ng (pid: 1981, threadinfo ffff8800935cc000, task ffff88009c0e0860) Stack: 000000000089dac1 ffff8800015f20c0 0000000396c74203 0000000000000100 ffff880096c743b0 ffffffff80277521 ffff8800015f20c0 0000000000000000 ffff8800015f20c0 ffff880096c743b0 ffffffff802ab3b1 00000000000000d0 Call Trace: <ffffffff80277521> check_poison_obj+0x152/0x1ae <ffffffff802ab3b1> elf_core_dump+0xe2/0xc2d <ffffffff802ab3b1> elf_core_dump+0xe2/0xc2d <ffffffff80278269> cache_alloc_debugcheck_after+0x34/0x1b0 <ffffffff802784d7> kmem_cache_alloc+0xf2/0x102 <ffffffff802ab3b1> elf_core_dump+0xe2/0xc2d <ffffffff8027a67c> do_truncate+0x60/0x69 <ffffffff802858f4> do_coredump+0x5a0/0x601 <ffffffff80277cea> kmem_cache_free+0xd0/0x140 <ffffffff80236db4> __dequeue_signal+0x18b/0x19a <ffffffff80238129> get_signal_to_deliver+0x4ee/0x549 <ffffffff80209115> do_signal+0x55/0x6d8 <ffffffff803c67db> do_page_fault+0x11f6/0x1253 <ffffffff88126d95> :xfs:xfs_iunlock+0x4f/0x7a <ffffffff8813eb19> :xfs:xfs_fsync+0x157/0x1a9 <ffffffff80256c9f> __filemap_fdatawrite_range+0x51/0x5b <ffffffff8020a055> retint_signal+0x5d/0xb8 Code: 48 8b 18 48 89 ef e8 11 fd ff ff 48 8b 30 48 c7 c7 da c3 3e RIP <ffffffff80277313> print_objinfo+0x22/0xde RSP <ffff8800935cdb48> CR2: 0000000396c7430b <3>ib_mthca 0000:08:00.0: Memory key throughput optimization activated. ib_mthca 0000:08:00.0: Allocated EQ 1 with 131072 entries ib_mthca 0000:08:00.0: Allocated EQ 2 with 512 entries ib_mthca 0000:08:00.0: Setting mask 00000000001f47fe for eqn 2 ib_mthca 0000:08:00.0: NOP command failed to generate interrupt (IRQ 16), aborting. ib_mthca 0000:08:00.0: BIOS or ACPI interrupt routing problem? ib_mthca 0000:08:00.0: Clearing mask 00000000001f47fe for eqn 2 ib_mthca 0000:08:00.0: HW2SW_EQ failed (-11) ib_mthca 0000:08:00.0: HW2SW_EQ returned status 0xff ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11) ib_mthca 0000:08:00.0: HW2SW_EQ failed (-11) ib_mthca 0000:08:00.0: HW2SW_EQ returned status 0xff ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11) ib_mthca 0000:08:00.0: HW2SW_MPT failed (-11) ib_mthca 0000:08:00.0: Unmapping 64 pages at 26800000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 26840000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 26890000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 26000000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 26400000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 24000000 from ICM. ib_mthca 0000:08:00.0: Unmapping 64 pages at 25800000 from ICM. -- Lukáš Hejtmánek _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
