Hi Roland.

After commit b13912bbb4a2 "IPoIB: Call skb_dst_drop() once skb is enqueued for sending".
Using connected mode and running multi-threaded iperf for long time,
that is iperf -c <IP> -P 16 -t 3600 results with the following crash.

Red Hat Enterprise Linux Server release 6.3 (Santiago)
Kernel 3.7.0 on an x86_64

r-hercules04 login: [ 693.383404] BUG: unable to handle kernel NULL pointer dereference at 00000000000000e8
[  693.424982] IP: [<ffffffff815138e8>] sock_wfree+0x18/0x70
[  693.452221] PGD 82bc6d067 PUD 82c441067 PMD 0
[  693.475288] Oops: 0000 [#1] SMP
[ 693.491589] Modules linked in: netconsole nfsv3 nfs_acl nfs ip6table_filter ip6_tables fscache ebtable_nat lockd ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM iptable_mangle sunrpc 8021q garp bridge stp llc rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad iw_nes libcrc32c iw_cxgb3 cxgb3 mdio mlx4_ib ib_sa ib_mthca ib_mad ib_core binfmt_misc vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr hpwdt hpilo sb_edac edac_core lpc_ich mfd_core mlx4_en mlx4_core tg3 ptp pps_core ioatdma dca acpi_power_meter hpsa
[  693.790490] CPU 0
[ 693.799561] Pid: 2248, comm: avahi-daemon Tainted: G W 3.7.0 #3 HP ProLiant DL380p Gen8 [ 693.845805] RIP: 0010:[<ffffffff815138e8>] [<ffffffff815138e8>] sock_wfree+0x18/0x70
[  693.885685] RSP: 0000:ffff88042fa03d10  EFLAGS: 00010282
[ 693.913293] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88082bee8e00 [ 693.950574] RDX: 0000000000001100 RSI: 00000004105240e4 RDI: ffff88042bcacef8 [ 693.987599] RBP: ffff88042fa03d20 R08: 0000000000000000 R09: 0000000000000003 [ 694.023758] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88042bcacef8 [ 694.062076] R13: ffff88041eae4800 R14: ffffc90018147330 R15: ffff88041eae5158 [ 694.100211] FS: 00007fa892ffe700(0000) GS:ffff88042fa00000(0000) knlGS:0000000000000000
[  694.142370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 694.172468] CR2: 00000000000000e8 CR3: 000000082b5e8000 CR4: 00000000000407f0 [ 694.208474] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 694.245605] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 694.282249] Process avahi-daemon (pid: 2248, threadinfo ffff88082bf56000, task ffff88082ce78000)
[  694.328163] Stack:
[ 694.338109] ffff88042fa03e00 ffff88042bcacef8 ffff88042fa03d40 ffffffff815775fd [ 694.376164] ffff88042bcacef8 ffff88042bcacef8 ffff88042fa03d60 ffffffff81518000 [ 694.413834] ffff88042fa03d70 ffff88042bcacef8 ffff88042fa03d80 ffffffff81517e66
[  694.451789] Call Trace:
[  694.465091]  <IRQ>
[  694.475075]  [<ffffffff815775fd>] tcp_wfree+0x3d/0xd0
[  694.501753]  [<ffffffff81518000>] skb_release_head_state+0x70/0x120
[  694.534482]  [<ffffffff81517e66>] __kfree_skb+0x16/0xa0
[  694.561060]  [<ffffffff8151855c>] consume_skb+0x2c/0x80
[  694.588186]  [<ffffffff8152271d>] dev_kfree_skb_any+0x3d/0x50
[ 694.617646] [<ffffffffa02ef458>] ipoib_cm_handle_tx_wc+0xa8/0x340 [ib_ipoib]
[  694.654583]  [<ffffffffa007b3c3>] ? mlx4_eq_int+0xc3/0x9e0 [mlx4_core]
[  694.688519]  [<ffffffffa02e72f4>] ipoib_poll+0x174/0x190 [ib_ipoib]
[  694.721526]  [<ffffffff813b5731>] ? add_interrupt_randomness+0x41/0x190
[  694.755056]  [<ffffffff81527ae3>] net_rx_action+0x103/0x280
[  694.783849]  [<ffffffff81068bf6>] __do_softirq+0xc6/0x240
[  694.812603]  [<ffffffff810eff7d>] ? handle_irq_event+0x4d/0x70
[  694.842805]  [<ffffffff8163109c>] call_softirq+0x1c/0x30
[  694.870480]  [<ffffffff810164a5>] do_softirq+0x65/0xa0
[  694.896157]  [<ffffffff810689dd>] irq_exit+0xcd/0xf0
[  694.922004]  [<ffffffff81631926>] do_IRQ+0x66/0xe0
[  694.947348]  [<ffffffff816279ad>] common_interrupt+0x6d/0x6d
[  694.977302]  <EOI>
[  694.986910]  [<ffffffff8162fd59>] ? system_call_fastpath+0x16/0x1b
[ 695.019750] Code: 00 00 e8 0c 87 ff ff eb ca 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 5f 18 8b 97 f0 00 00 00 <48> 8b 8b e8 00 00 00 89 d0 80 e5 02 74 22 f0 29 83 04 01 00 00
[  695.118690] RIP  [<ffffffff815138e8>] sock_wfree+0x18/0x70
[  695.147669]  RSP <ffff88042fa03d10>
[  695.165503] CR2: 00000000000000e8
[  695.182436] ---[ end trace 71c706b1d9cf8572 ]---
[  695.207779] Kernel panic - not syncing: Fatal exception in interrupt
[  695.242603] ------------[ cut here ]------------
[ 695.268868] WARNING: at arch/x86/kernel/smp.c:123 native_smp_send_reschedule+0x5c/0x60()
[  695.312657] Hardware name: ProLiant DL380p Gen8
[ 695.336881] Modules linked in: netconsole nfsv3 nfs_acl nfs ip6table_filter ip6_tables fscache ebtable_nat lockd ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_CHECKSUM iptable_mangle sunrpc 8021q garp bridge stp llc rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad iw_nes libcrc32c iw_cxgb3 cxgb3 mdio mlx4_ib ib_sa ib_mthca ib_mad ib_core binfmt_misc vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode serio_raw pcspkr hpwdt hpilo sb_edac edac_core lpc_ich mfd_core mlx4_en mlx4_core tg3 ptp pps_core ioatdma dca acpi_power_meter hpsa
[  695.640007] Pid: 2248, comm: avahi-daemon Tainted: G      D W 3.7.0 #3
[  695.677587] Call Trace:
[  695.690504]  <IRQ>  [<ffffffff8106016f>] warn_slowpath_common+0x7f/0xc0
[  695.725938]  [<ffffffff810601ca>] warn_slowpath_null+0x1a/0x20
[  695.755393]  [<ffffffff81038c9c>] native_smp_send_reschedule+0x5c/0x60
[  695.789711]  [<ffffffff8109ba0a>] trigger_load_balance+0x1aa/0x230
[  695.821367]  [<ffffffff8109346f>] scheduler_tick+0x10f/0x140
[  695.850842]  [<ffffffff8107030e>] update_process_times+0x6e/0x90
[  695.881763]  [<ffffffff810b9ca9>] tick_sched_handle+0x39/0x80
[  695.912138]  [<ffffffff810b9ed4>] tick_sched_timer+0x54/0x90
[  695.941189]  [<ffffffff810882f1>] __run_hrtimer+0x71/0x1d0
[  695.968818]  [<ffffffff810b9e80>] ? tick_nohz_handler+0xc0/0xc0
[  696.001198]  [<ffffffff810886d6>] hrtimer_interrupt+0xf6/0x240
[  696.031610]  [<ffffffff81631a09>] smp_apic_timer_interrupt+0x69/0x99
[  696.064166]  [<ffffffff8163095d>] apic_timer_interrupt+0x6d/0x80
[  696.095812]  [<ffffffff81624582>] ? panic+0x19c/0x1e4
[  696.122152]  [<ffffffff816244df>] ? panic+0xf9/0x1e4
[  696.148977]  [<ffffffff816288ca>] oops_end+0xea/0xf0
[  696.174674]  [<ffffffff81048b2d>] no_context+0x11d/0x2d0
[  696.201968]  [<ffffffff81048e1d>] __bad_area_nosemaphore+0x13d/0x220
[  696.234976]  [<ffffffff81048f13>] bad_area_nosemaphore+0x13/0x20
[  696.266854]  [<ffffffff8162b60a>] __do_page_fault+0x37a/0x4a0
[  696.296904]  [<ffffffff8155abee>] ? ip_local_deliver_finish+0xde/0x290
[  696.331229]  [<ffffffff8155ade7>] ? ip_local_deliver+0x47/0x80
[  696.361302]  [<ffffffff8155a489>] ? ip_rcv_finish+0x119/0x360
[  696.391923]  [<ffffffff8155aa31>] ? ip_rcv+0x221/0x300
[  696.418131]  [<ffffffff8162b73e>] do_page_fault+0xe/0x10
[  696.446177]  [<ffffffff81627c98>] page_fault+0x28/0x30
[  696.471956]  [<ffffffff815138e8>] ? sock_wfree+0x18/0x70
[  696.498716]  [<ffffffff815775fd>] tcp_wfree+0x3d/0xd0
[  696.527393]  [<ffffffff81518000>] skb_release_head_state+0x70/0x120
[  696.558843]  [<ffffffff81517e66>] __kfree_skb+0x16/0xa0
[  696.586331]  [<ffffffff8151855c>] consume_skb+0x2c/0x80
[  696.613061]  [<ffffffff8152271d>] dev_kfree_skb_any+0x3d/0x50
[ 696.643584] [<ffffffffa02ef458>] ipoib_cm_handle_tx_wc+0xa8/0x340 [ib_ipoib]
[  696.680581]  [<ffffffffa007b3c3>] ? mlx4_eq_int+0xc3/0x9e0 [mlx4_core]
[  696.715184]  [<ffffffffa02e72f4>] ipoib_poll+0x174/0x190 [ib_ipoib]
[  696.747030]  [<ffffffff813b5731>] ? add_interrupt_randomness+0x41/0x190
[  696.782367]  [<ffffffff81527ae3>] net_rx_action+0x103/0x280
[  696.810831]  [<ffffffff81068bf6>] __do_softirq+0xc6/0x240
[  696.839378]  [<ffffffff810eff7d>] ? handle_irq_event+0x4d/0x70
[  696.868711]  [<ffffffff8163109c>] call_softirq+0x1c/0x30
[  696.896905]  [<ffffffff810164a5>] do_softirq+0x65/0xa0
[  696.922869]  [<ffffffff810689dd>] irq_exit+0xcd/0xf0
[  696.948477]  [<ffffffff81631926>] do_IRQ+0x66/0xe0
[  696.973477]  [<ffffffff816279ad>] common_interrupt+0x6d/0x6d
[  697.002119]  <EOI>  [<ffffffff8162fd59>] ? system_call_fastpath+0x16/0x1b
[  697.038451] ---[ end trace 71c706b1d9cf8573 ]---

Disassembling vmlinux shows that the NULL dereferencing address "sock_wfree+0x18" is
    where "skb->sk" is dereferenced.

    /*
     * Write buffer destructor automatically called from kfree_skb.
     */
    void sock_wfree(struct sk_buff *skb)
    {
    ffffffff815138d0:       55                      push   %rbp
    ffffffff815138d1:       48 89 e5                mov    %rsp,%rbp
    ffffffff815138d4:       53                      push   %rbx
    ffffffff815138d5:       48 83 ec 08             sub    $0x8,%rsp
ffffffff815138d9: e8 22 c0 11 00 callq ffffffff8162f900 <mcount>
            struct sock *sk = skb->sk;
    ffffffff815138de:       48 8b 5f 18             mov 0x18(%rdi),%rbx
            unsigned int len = skb->truesize;
    ffffffff815138e2:       8b 97 f0 00 00 00       mov 0xf0(%rdi),%edx
    ffffffff815138e8:       48 8b 8b e8 00 00 00    mov 0xe8(%rbx),%rcx

Crash analysis.

After the mentioned patch the driver is calling skb_orphan and skb_dst_drop after calling post_send in ipoib_cm.c:: ipoib_cm_send "also in ipoib_ib.c:: ipoib_send"

The problem with this is as it is written in a comment in both routines is, quote "it's entirely possible that the completion handler will run before we execute
anything after the post_send()."

The rest of the analysis of the crash (below) is as follows.
The call the skbuff.h::skb_orphan is delayed, meanwhile the completion starts on another CPU on NAPI context, and in the routine "skbuff.c::skb_release_head_state" the skb->destructor is still pointing to "tcp_output.c::tcp_wfree". At this point the "skbuff.h::skb_orpahn" executes the destructor too and nullified the "skb->destructor" and the "skb->sk" fields.
We can have two destructors running at the same time!
And note that the destructor called from TX complete can dereference the nullified "skb->sk", if it is called before the orphan nullified "skb->destructor" but dereference "skb->sk"
right after it was nullified.
This scenario is consistent with the fact that the "skb->destructor" namely "tcp_output.c::tcp_wfree"
was called and when it called "sock.c::sock_wfree" the skb->sk" was NULL.

I don't know why this issue didn't manifested itself with UD. Maybe UD transmit completes faster.

Proposed solution (patch will follow)

I think the solution is to always perform the skb_orphan and skb_dst_drop before the transmission. In case error occurs than it will be no different than the regular case where dev_free_skb_any
in the completion path which is assumed to be after these two routines.

Shlomo Pongratz.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to