Re: list corruption in IPOIB

Shlomo Pongratz Mon, 20 May 2013 06:38:48 -0700

On 5/20/2013 3:58 PM, Jack Wang wrote:

I haven't reproduced the original bug we saw in our productionenvironment

BUG: unable to handle kernel
  at 0000000000000008
IP: [<ffffffffa0206c30>] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]
 ...

RIP: 0010:[<ffffffffa0206c30>] [<ffffffffa0206c30>]ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]

RSP: 0018:ffff8825fdcbddb0  EFLAGS: 00010086
RAX: 0000000000000246 RBX: ffff8807b59c29c0 RCX: 0000000000000000
RDX: 4400000006000002 RSI: 0000000000000246 RDI: ffff8810026527c0
RBP: ffff881002652000 R08: 0000000000015360 R09: dead000000200200
R10: dead000000100100 R11: 0000000000000001 R12: 0000000000000001
R13: 0000000000000000 R14: ffff8810026523a0 R15: ffff8810026527c0

FS: 00007f4c9a325700(0000) GS:ffff880807c00000(0000)knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000002605e3a000 CR4: 00000000000407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Process kworker/u:3 (pid: 61374, threadinfo ffff8825fdcbc000, taskffff8807fd0eafb0)

Stack:
 ffff8820043303c0
 ffff880807d52700
 ffff8807fd0eafb0
 ffff8825fdcbdde0


 ffff8810026533b8
 ffffffffa0039868
 ffff8825fdcbdde0
 ffff8805fd549a00

 ffffffff81b9d480
 ffff8807fd2f4000
 ffffffffa0206b50
 0000000000000000

Call Trace:
 [<ffffffffa0039868>] ? process_req+0xe8/0x1a0 [ib_addr]
 [<ffffffffa0206b50>] ? ipoib_cm_tx_handler+0x2d0/0x2d0 [ib_ipoib]
 [<ffffffff81052d64>] ? process_one_work+0x114/0x470
 [<ffffffff81055033>] ? worker_thread+0x163/0x3e0
 [<ffffffff81054ed0>] ? manage_workers+0x200/0x200
 [<ffffffff81054ed0>] ? manage_workers+0x200/0x200
 [<ffffffff8105963e>] ? kthread+0x9e/0xb0
 [<ffffffff8167e9e4>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff810595a0>] ? kthread_freezable_should_stop+0x60/0x60
 [<ffffffff8167e9e0>] ? gs_change+0x13/0x13
 ...
 [<ffffffffa01fec30>] ipoib_cm_tx_reap+0xe0/0x5a0 [ib_ipoib]
 RSP <ffff881d275f1db0>
---[ end trace 38ff082cbc03dd75 ]---
Kernel panic - not syncing: Fatal exception in interrupt



, only the A variant of the crash in has been reproduced:

WARNING: at lib/list_debug.c:49 __list_del_entry+0x63/0xd0()
Hardware name: System Product Name

list_del corruption, ffff88020dbd3080->next is LIST_POISON1(dead000000100100)

Modules linked in: ...
Pid: 16248, comm: iperf Tainted: G        W  3.4.23-pserver+ #76
Call Trace:
 <IRQ>  [<ffffffff8103c21f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff8103c316>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff81428563>] ? do_raw_spin_lock+0xd3/0x140
 [<ffffffff81428883>] __list_del_entry+0x63/0xd0
 [<ffffffff81428901>] list_del+0x11/0x40
 [<ffffffffa02f64c5>] ipoib_cm_handle_tx_wc+0x225/0x380 [ib_ipoib]
 [<ffffffffa02eea44>] ipoib_poll+0x164/0x190 [ib_ipoib]
 [<ffffffff815d91fd>] net_rx_action+0x13d/0x320
 [<ffffffff81044f29>] ? __do_softirq+0x89/0x380
 [<ffffffff81044f98>] __do_softirq+0xf8/0x380
 [<ffffffff8174632c>] call_softirq+0x1c/0x30
 <EOI>  [<ffffffff81004305>] do_softirq+0x95/0xd0
 [<ffffffff815daacc>] ? dev_queue_xmit+0x29c/0xbf0
 [<ffffffff8104461b>] local_bh_enable+0xeb/0xf0
 [<ffffffff815daacc>] dev_queue_xmit+0x29c/0xbf0
 [<ffffffff815da830>] ? ptype_seq_start+0xb0/0xb0
 [<ffffffff815e0d87>] neigh_connected_output+0xc7/0x110
 [<ffffffff8109f36d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81617386>] ip_finish_output2+0x1c6/0x460
 [<ffffffff8161723a>] ? ip_finish_output2+0x7a/0x460
 [<ffffffff81619033>] ip_finish_output+0xc3/0x230
 [<ffffffff81619510>] ip_output+0xa0/0x110
 [<ffffffff8161764d>] ip_local_out+0x2d/0x90
 [<ffffffff816176cb>] ip_send_skb+0x1b/0x60
 [<ffffffff8163f27b>] udp_send_skb+0x10b/0x380
 [<ffffffff815c3a70>] ? sock_def_wakeup+0x1b0/0x1b0
 [<ffffffff81616e90>] ? ip_append_page+0x530/0x530
 [<ffffffff81641462>] udp_sendmsg+0x3b2/0xb50
 [<ffffffff8173c530>] ? retint_restore_args+0x13/0x13
 [<ffffffff8164d9a0>] ? inet_create+0x5b0/0x5b0
 [<ffffffff815c2310>] ? sock_update_classid+0x150/0x2b0
 [<ffffffff8164dacb>] inet_sendmsg+0x12b/0x240
 [<ffffffff8164d9a0>] ? inet_create+0x5b0/0x5b0
 [<ffffffff815c2272>] ? sock_update_classid+0xb2/0x2b0
 [<ffffffff815c2310>] ? sock_update_classid+0x150/0x2b0
 [<ffffffff815bda40>] sock_aio_write+0x190/0x1b0
 [<ffffffff8142214e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff8116e82a>] do_sync_write+0xea/0x130
 [<ffffffff8109bfdd>] ? trace_hardirqs_off+0xd/0x10
 [<ffffffff811713d3>] ? fget_light+0x43/0x490
 [<ffffffff813b14f3>] ? security_file_permission+0x23/0x90
 [<ffffffff8116ee82>] vfs_write+0x172/0x190
 [<ffffffff8116ef91>] sys_write+0x51/0x90
 [<ffffffff81744de9>] system_call_fastpath+0x16/0x1b
---[ end trace 66110390802a41db ]---

after apply
commit fa16ebed31f336e41970f3f0ea9e8279f6be2d27

Author: Shlomo Pongratz <[email protected]<mailto:[email protected]>>

  Date:   Mon Aug 13 14:39:49 2012 +0000

      IB/ipoib: Add missing locking when CM object is deleted

Above warning is gone, but we still see the warning at the begin ofthis thread.

2013/5/20 Or Gerlitz <[email protected]<mailto:[email protected]>>


    On 20/05/2013 15:46, Jinpu Wang wrote:

        A quick test show the list_corruption warning is gone, after I
        convert
          all list_del(&neigh->list) to  list_del_list(&neigh->list).


    yes, but this wasn't your original problem or was it?


    --
    To unsubscribe from this list: send the line "unsubscribe
    linux-rdma" in
    the body of a message to [email protected]
    <mailto:[email protected]>
    More majordomo info at http://vger.kernel.org/majordomo-info.html


Hi Jack,

I don't understand what is the current status, that is what do you seenow after applying the patches.If you don't get the original bug why did you gave the trace of it? Oris it a new trace? It is not clear from your mail.

Please add only the trace of the current issue.

Best regards,

S.P.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: list corruption in IPOIB

Reply via email to