On 5/20/2013 3:58 PM, Jack Wang wrote:
I haven't reproduced the original bug we saw in our production environment
BUG: unable to handle kernel
  at 0000000000000008
IP: [<ffffffffa0206c30>] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]
 ...
RIP: 0010:[<ffffffffa0206c30>] [<ffffffffa0206c30>] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]
RSP: 0018:ffff8825fdcbddb0  EFLAGS: 00010086
RAX: 0000000000000246 RBX: ffff8807b59c29c0 RCX: 0000000000000000
RDX: 4400000006000002 RSI: 0000000000000246 RDI: ffff8810026527c0
RBP: ffff881002652000 R08: 0000000000015360 R09: dead000000200200
R10: dead000000100100 R11: 0000000000000001 R12: 0000000000000001
R13: 0000000000000000 R14: ffff8810026523a0 R15: ffff8810026527c0
FS: 00007f4c9a325700(0000) GS:ffff880807c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000002605e3a000 CR4: 00000000000407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:3 (pid: 61374, threadinfo ffff8825fdcbc000, task ffff8807fd0eafb0)
Stack:
 ffff8820043303c0
 ffff880807d52700
 ffff8807fd0eafb0
 ffff8825fdcbdde0

 ffff8810026533b8
 ffffffffa0039868
 ffff8825fdcbdde0
 ffff8805fd549a00

 ffffffff81b9d480
 ffff8807fd2f4000
 ffffffffa0206b50
 0000000000000000

Call Trace:
 [<ffffffffa0039868>] ? process_req+0xe8/0x1a0 [ib_addr]
 [<ffffffffa0206b50>] ? ipoib_cm_tx_handler+0x2d0/0x2d0 [ib_ipoib]
 [<ffffffff81052d64>] ? process_one_work+0x114/0x470
 [<ffffffff81055033>] ? worker_thread+0x163/0x3e0
 [<ffffffff81054ed0>] ? manage_workers+0x200/0x200
 [<ffffffff81054ed0>] ? manage_workers+0x200/0x200
 [<ffffffff8105963e>] ? kthread+0x9e/0xb0
 [<ffffffff8167e9e4>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff810595a0>] ? kthread_freezable_should_stop+0x60/0x60
 [<ffffffff8167e9e0>] ? gs_change+0x13/0x13
 ...
 [<ffffffffa01fec30>] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]
 RSP <ffff881d275f1db0>
---[ end trace 38ff082cbc03dd75 ]---
Kernel panic - not syncing: Fatal exception in interrupt



, only the A variant of the crash in has been reproduced:

WARNING: at lib/list_debug.c:49 __list_del_entry+0x63/0xd0()
Hardware name: System Product Name
list_del corruption, ffff88020dbd3080->next is LIST_POISON1 (dead000000100100)
Modules linked in: ...
Pid: 16248, comm: iperf Tainted: G        W  3.4.23-pserver+ #76
Call Trace:
 <IRQ>  [<ffffffff8103c21f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff8103c316>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff81428563>] ? do_raw_spin_lock+0xd3/0x140
 [<ffffffff81428883>] __list_del_entry+0x63/0xd0
 [<ffffffff81428901>] list_del+0x11/0x40
 [<ffffffffa02f64c5>] ipoib_cm_handle_tx_wc+0x225/0x380 [ib_ipoib]
 [<ffffffffa02eea44>] ipoib_poll+0x164/0x190 [ib_ipoib]
 [<ffffffff815d91fd>] net_rx_action+0x13d/0x320
 [<ffffffff81044f29>] ? __do_softirq+0x89/0x380
 [<ffffffff81044f98>] __do_softirq+0xf8/0x380
 [<ffffffff8174632c>] call_softirq+0x1c/0x30
 <EOI>  [<ffffffff81004305>] do_softirq+0x95/0xd0
 [<ffffffff815daacc>] ? dev_queue_xmit+0x29c/0xbf0
 [<ffffffff8104461b>] local_bh_enable+0xeb/0xf0
 [<ffffffff815daacc>] dev_queue_xmit+0x29c/0xbf0
 [<ffffffff815da830>] ? ptype_seq_start+0xb0/0xb0
 [<ffffffff815e0d87>] neigh_connected_output+0xc7/0x110
 [<ffffffff8109f36d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81617386>] ip_finish_output2+0x1c6/0x460
 [<ffffffff8161723a>] ? ip_finish_output2+0x7a/0x460
 [<ffffffff81619033>] ip_finish_output+0xc3/0x230
 [<ffffffff81619510>] ip_output+0xa0/0x110
 [<ffffffff8161764d>] ip_local_out+0x2d/0x90
 [<ffffffff816176cb>] ip_send_skb+0x1b/0x60
 [<ffffffff8163f27b>] udp_send_skb+0x10b/0x380
 [<ffffffff815c3a70>] ? sock_def_wakeup+0x1b0/0x1b0
 [<ffffffff81616e90>] ? ip_append_page+0x530/0x530
 [<ffffffff81641462>] udp_sendmsg+0x3b2/0xb50
 [<ffffffff8173c530>] ? retint_restore_args+0x13/0x13
 [<ffffffff8164d9a0>] ? inet_create+0x5b0/0x5b0
 [<ffffffff815c2310>] ? sock_update_classid+0x150/0x2b0
 [<ffffffff8164dacb>] inet_sendmsg+0x12b/0x240
 [<ffffffff8164d9a0>] ? inet_create+0x5b0/0x5b0
 [<ffffffff815c2272>] ? sock_update_classid+0xb2/0x2b0
 [<ffffffff815c2310>] ? sock_update_classid+0x150/0x2b0
 [<ffffffff815bda40>] sock_aio_write+0x190/0x1b0
 [<ffffffff8142214e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff8116e82a>] do_sync_write+0xea/0x130
 [<ffffffff8109bfdd>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff811713d3>] ? fget_light+0x43/0x490
 [<ffffffff813b14f3>] ? security_file_permission+0x23/0x90
 [<ffffffff8116ee82>] vfs_write+0x172/0x190
 [<ffffffff8116ef91>] sys_write+0x51/0x90
 [<ffffffff81744de9>] system_call_fastpath+0x16/0x1b
---[ end trace 66110390802a41db ]---

after apply
commit fa16ebed31f336e41970f3f0ea9e8279f6be2d27
Author: Shlomo Pongratz <[email protected] <mailto:[email protected]>>
  Date:   Mon Aug 13 14:39:49 2012 +0000

      IB/ipoib: Add missing locking when CM object is deleted

Above warning is gone, but we still see the warning at the begin of this thread.



2013/5/20 Or Gerlitz <[email protected] <mailto:[email protected]>>

    On 20/05/2013 15:46, Jinpu Wang wrote:

        A quick test show the list_corruption warning is gone, after I
        convert
          all list_del(&neigh->list) to  list_del_list(&neigh->list).


    yes, but this wasn't your original problem or was it?


    --
    To unsubscribe from this list: send the line "unsubscribe
    linux-rdma" in
    the body of a message to [email protected]
    <mailto:[email protected]>
    More majordomo info at http://vger.kernel.org/majordomo-info.html



Hi Jack,

I don't understand what is the current status, that is what do you see now after applying the patches. If you don't get the original bug why did you gave the trace of it? Or is it a new trace? It is not clear from your mail.
Please add only the trace of the current issue.

Best regards,

S.P.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to