We are still seeing kernel panics with linux-3.2, this time initiated from mthca_cq_event(). I'm unsure if this is somehow related to the yesterdays cq_completion patch. In any case, I'm CCing Sean therefore.

kernel logs sometimes show something like

ib_mthca 0000:01:00.0: CQ access violation on CQN 2c0089

and at the same time either our FhGFS daemons, which are using ibverbs crash with a segmentation fault or the entire kernel crashes with panic as given below. My next step is to debug our FhGFS crashes to see if this is from ib libs or a real issue of the daemon.

Below is the kernel panic. The kernel already includes the patch to initialized qp->usecnt.

[53904.589342] ib_mthca 0000:01:00.0: CQ access violation on CQN 00008b
[53964.464518] ib_mthca 0000:01:00.0: CQ access violation on CQN d2009f
[53964.468302] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000058
[53964.468302] IP: [<ffffffffa03a71a8>] ib_uverbs_async_handler+0x28/0x150 
[ib_uverbs]
[53964.468302] PGD 1f8d18067 PUD 1f3904067 PMD 0
[53964.468302] Oops: 0000 [#1] SMP
[53964.468302] CPU 1
[53964.468302] Modules linked in: nfsd ext4 mbcache jbd2 crc16 mlx4_ib 
mlx4_core ib_umad rdma_ucm rdma_cm iw_cm ib_addr ib_uverbs ib_ipoib ib_cm ib_sa 
sg ipv6 sd_mod crc_t10dif loop arcmsr md_mod pcspkr 8250_pnp ib_mthca ib_mad 
ib_core fuse af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc btrfs 
lzo_decompress lzo_compress zlib_deflate crc32c libcrc32c crypto_hash 
crypto_algapi ata_generic pata_acpi pata_amd e1000 sata_nv libata scsi_mod unix 
[last unloaded: scsi_wait_scan]
[53964.468302]
[53964.468302] Pid: 10644, comm: fhgfs-storage-u Not tainted 3.2.0+ #10 
Supermicro H8DCE/H8DCE
[53964.468302] RIP: 0010:[<ffffffffa03a71a8>]  [<ffffffffa03a71a8>] 
ib_uverbs_async_handler+0x28/0x150 [ib_uverbs]
[53964.468302] RSP: 0018:ffff8801ffc039b0  EFLAGS: 00010082
[53964.468302] RAX: ffff8801f948e300 RBX: 0000000000000000 RCX: ffff8801f948e370
[53964.468302] RDX: 0000000000000000 RSI: ffff8801f948ee40 RDI: 0000000000000000
[53964.468302] RBP: ffff8801ffc039f0 R08: ffff8801f948e384 R09: ffffffff8142c5e0
[53964.468302] R10: 0000000000000006 R11: 000000000000000d R12: 0000000000d2009f
[53964.468302] R13: ffff8800bf5aba20 R14: 0000000000000000 R15: ffff8801f3a82400
[53964.468302] FS:  00007ffff4ca7700(0000) GS:ffff8801ffc00000(0000) 
knlGS:0000000000000000
[53964.468302] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[53964.468302] CR2: 0000000000000058 CR3: 00000001f96d4000 CR4: 00000000000006e0
[53964.468302] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[53964.468302] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[53964.468302] Process fhgfs-storage-u (pid: 10644, threadinfo 
ffff880000090000, task ffff8800c8139650)
[53964.468302] Stack:
[53964.468302]  ffff8801ffc03a00 ffff8801f948e384 ffffffffa0318208 
ffff8800bf5ab000
[53964.468302]  0000000000d2009f ffff8800bf5aba20 0000000000000000 
ffff8801f3a82400
[53964.468302]  ffff8801ffc03a00 ffffffffa03a737b ffff8801ffc03a60 
ffffffffa0306f77
[53964.468302] Call Trace:
[53964.468302]  <IRQ>
[53964.468302]  [<ffffffffa03a737b>] ib_uverbs_cq_event_handler+0x2b/0x30 
[ib_uverbs]
[53964.468302]  [<ffffffffa0306f77>] mthca_cq_event+0x87/0x110 [ib_mthca]
[53964.468302]  [<ffffffffa03062a4>] mthca_eq_int+0x2d4/0x410 [ib_mthca]
[53964.468302]  [<ffffffffa0306544>] mthca_arbel_msi_x_interrupt+0x24/0x60 
[ib_mthca]
[53964.468302]  [<ffffffff810b54fd>] handle_irq_event_percpu+0x5d/0x210
[53964.468302]  [<ffffffff810b56f0>] handle_irq_event+0x40/0x70
[53964.468302]  [<ffffffff810b8d0d>] handle_edge_irq+0x6d/0x120
[53964.468302]  [<ffffffff810166a2>] handle_irq+0x22/0x30
[53964.468302]  [<ffffffff81390aad>] do_IRQ+0x5d/0xe0
[53964.468302]  [<ffffffff81385eb3>] common_interrupt+0x73/0x73
[53964.468302]  [<ffffffff812e3f9b>] ? __alloc_skb+0x4b/0x170
[53964.468302]  [<ffffffff8113e0fb>] ? kmem_cache_alloc_node+0x3b/0x130
[53964.468302]  [<ffffffff8131af61>] ? ip_rcv+0x201/0x2e0
[53964.468302]  [<ffffffff812e3f9b>] __alloc_skb+0x4b/0x170
[53964.468302]  [<ffffffff812e457d>] dev_alloc_skb+0x1d/0x40
[53964.468302]  [<ffffffffa0395fca>] ipoib_alloc_rx_skb+0x4a/0x380 [ib_ipoib]


ib_uverbs_async_handler+0x28 translates to

Reading symbols from 
/home/schubert/src/linux/linux-stable/debian/tmp/lib/modules/3.2.0+/kernel/drivers/infiniband/core/ib_uverbs.ko...done.
(gdb) l *(ib_uverbs_async_handler+0x28)
0x11a8 is in ib_uverbs_async_handler 
(drivers/infiniband/core/uverbs_main.c:440).
435                                         u32 *counter)
436     {
437             struct ib_uverbs_event *entry;
438             unsigned long flags;
439
440             spin_lock_irqsave(&file->async_file->lock, flags);
441             if (file->async_file->is_closed) {
442                     spin_unlock_irqrestore(&file->async_file->lock, flags);
443                     return;
444             }


Any ideas?


Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to