Title: oops on module teardown (was Re: recursion depth exceeded in ipoib_workqueue )

I tested out your recursion patch on SVN 3487, and it works.  However, while testing it out, I got the kernel Oops described below (while unloading the driver). Looks like a race condition (Note that this is in the send-timeout flow) .

From disassembly of ib_ipoib.ko (no line-debug info unfortunately), failure is at address 5360:
    534c:       48 89 95 b0 00 00 00    mov    %rdx,0xb0(%rbp)
    5353:       f0 ff 0d 00 00 00 00    lock decl 0(%rip)        # 535a <ipoib_mcast_join_complete+0x1fa>
    535a:       0f 88 d9 03 00 00       js     5739 <.text.lock.ipoib_multicast+0x50>
    5360:       41 8b 45 10             mov    0x10(%r13),%eax
    5364:       a8 20                   test   $0x20,%al

I traced the source code to ipoib_multicast.c:434 ( in ipoib_mcast_join_complete):
        if (test_bit(IPOIB_MCAST_RUN, &priv->flags))

The dereference failure is in trying to dereference "priv->flags". (dereferencing priv->flags is the code at address 5360).

"priv" here is "netdev_priv(dev)", implying that "netdev_priv(dev)" is no longer valid and returns garbage.  This garbage gets dereferenced.

environment:
Host 1 Port 1 connected back-to-back to Host 2 Port 1.

Host 1: while date; do /etc/init.d/openibd start ; /etc/init.d/openibd stop ; done
Host 2: runs opensm.

Jack
================================================================================================================

Sep 20 12:05:30 swlab163 kernel: Unable to handle kernel NULL pointer dereference at 0000000000000390 RIP:
Sep 20 12:05:30 swlab163 kernel: <ffffffff8807a360>{:ib_ipoib:ipoib_mcast_join_complete+512}
Sep 20 12:05:30 swlab163 kernel: PGD 777d2067 PUD 773ca067 PMD 0
Sep 20 12:05:30 swlab163 kernel: Oops: 0000 [1] SMP
Sep 20 12:05:30 swlab163 kernel: CPU 0
Sep 20 12:05:30 swlab163 kernel: Modules linked in: ib_ipoib ib_sa ib_uverbs ib_umad ib_mthca ib_mad ib_core video1394 ohci1394 raw1394 ieee1394

Sep 20 12:05:30 swlab163 kernel: Pid: 11302, comm: ib_mad2 Not tainted 2.6.13
Sep 20 12:05:30 swlab163 kernel: RIP: 0010:[<ffffffff8807a360>] <ffffffff8807a360>{:ib_ipoib:ipoib_mcast_join_complete+512}

Sep 20 12:05:30 swlab163 kernel: RSP: 0018:ffff810055bc1d38  EFLAGS: 00010247
Sep 20 12:05:30 swlab163 kernel: RAX: 0000000000000000 RBX: ffffffff8807e000 RCX: ffffffff88070e10
Sep 20 12:05:30 swlab163 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8807e000
Sep 20 12:05:30 swlab163 kernel: RBP: ffff810053b10880 R08: ffff810055bc0000 R09: 0000000000000000
Sep 20 12:05:30 swlab163 kernel: R10: 00000000ffffffff R11: ffffffff8055f320 R12: 00000000ffffff92
Sep 20 12:05:30 swlab163 kernel: R13: 0000000000000380 R14: ffff81007e409a78 R15: ffffffff88042bd0
Sep 20 12:05:30 swlab163 kernel: FS:  00002aaaab15db00(0000) GS:ffffffff805d4800(0000) knlGS:0000000056729bb0
Sep 20 12:05:30 swlab163 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Sep 20 12:05:30 swlab163 kernel: CR2: 0000000000000390 CR3: 00000000777d3000 CR4: 00000000000006e0
Sep 20 12:05:30 swlab163 kernel: Process ib_mad2 (pid: 11302, threadinfo ffff810055bc0000, task ffff810054734830)
Sep 20 12:05:30 swlab163 kernel: Stack: ffff81007a8324c0 ffff810054734830 ffffffff805dffb0 ffffffff803f8855
Sep 20 12:05:30 swlab163 kernel:        ffff810055bc1e58 0000000000000296 ffff810054982f90 00000000ffffff92
Sep 20 12:05:30 swlab163 kernel:        ffff81007e409a10 ffffffff88070e5c
Sep 20 12:05:30 swlab163 kernel: Call Trace:<ffffffff803f8855>{thread_return+0} <ffffffff88070e5c>{:ib_sa:ib_sa_mcmember_rec_callback+76}

Sep 20 12:05:30 swlab163 kernel:        <ffffffff8807060c>{:ib_sa:send_handler+156} <ffffffff88042d4e>{:ib_mad:timeout_sends+382}

Sep 20 12:05:30 swlab163 kernel:        <ffffffff80132ca3>{__wake_up+67} <ffffffff80147e7e>{worker_thread+478}
Sep 20 12:05:30 swlab163 kernel:        <ffffffff80132210>{default_wake_function+0} <ffffffff8012f793>{__wake_up_common+67}

Sep 20 12:05:30 swlab163 kernel:        <ffffffff80132210>{default_wake_function+0} <ffffffff8014c3d0>{keventd_create_kthread+0}

Sep 20 12:05:30 swlab163 kernel:        <ffffffff80147ca0>{worker_thread+0} <ffffffff8014c3d0>{keventd_create_kthread+0}

Sep 20 12:05:30 swlab163 kernel:        <ffffffff8014c529>{kthread+217} <ffffffff8010e50e>{child_rip+8}
Sep 20 12:05:30 swlab163 kernel:        <ffffffff8014c3d0>{keventd_create_kthread+0} <ffffffff8014c450>{kthread+0}
Sep 20 12:05:30 swlab163 kernel:        <ffffffff8010e506>{child_rip+0}
Sep 20 12:05:30 swlab163 kernel:
Sep 20 12:05:30 swlab163 kernel: Code: 41 8b 45 10 a8 20 74 3e 41 83 fc 92 75 15 48 8b 3d cb 46 00
Sep 20 12:05:30 swlab163 kernel: RIP <ffffffff8807a360>{:ib_ipoib:ipoib_mcast_join_complete+512} RSP <ffff810055bc1d38>
Sep 20 12:05:30 swlab163 kernel: CR2: 0000000000000390

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to