On 07/11/2012 06:18 PM, akepner wrote: > Using the 3.7.21 version of the ixgbe driver we can reliably > produce a crash with this signature: > > BUG: unable to handle kernel NULL pointer dereference at 000000000000006c > IP: [<ffffffffa005afef>] ixgbe_poll+0x9df/0x1710 [ixgbe] > PGD 814c7b067 PUD 8074dd067 PMD 0 > Oops: 0000 [#1] SMP > last sysfs file: /sys/devices/virtual/bypass/8-9/ping_watchdog > CPU 2 > Pid: 18925, comm: sport Tainted: P ---------------- 2.6.32-perf > #1 To Be Filled By O.E.M. > RIP: 0010:[<ffffffffa005afef>] [<ffffffffa005afef>] ixgbe_poll+0x9df/0x1710 > [ixgbe] > RSP: 0018:ffff88080750b8b0 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: ffff88040f816f00 RCX: 0000000000000000 > RDX: 0000000000000020 RSI: ffffc9000429c000 RDI: ffff88040f891d80 > RBP: ffff88080750b970 R08: 0000000000000100 R09: 0000000000000000 > R10: 0000000000000100 R11: ffff88080750bfd8 R12: 0000000000000000 > R13: ffffc900041221b8 R14: ffff8804077580b0 R15: 000000000000000b > FS: 00007f61ccda9700(0000) GS:ffff880028280000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 000000000000006c CR3: 0000000814436000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process sport (pid: 18925, threadinfo ffff88080750a000, task ffff880814beeb60) > Stack: > 0000000000000000 ffff8804148b0540 000000000000000e ffff88040703d1c0 > <0> ffff8804148b0598 ffff88080750b918 ffffffff815671ac 00000001000359c2 > <0> ffff880410744700 0000000000000040 ffff88040f891d80 0000004011087c9c > Call Trace: > [<ffffffff815671ac>] ? ip_finish_output+0x13c/0x310 > [<ffffffff8152b468>] net_rx_action+0xb8/0x400 > [<ffffffff81517a84>] ? sock_def_readable+0x44/0x80 > [<ffffffff81066a91>] __do_soft/irq+0xc1/0x1d0 > [<ffffffff8100c1ec>] call_softirq+0x1c/0x30 > [<ffffffff8100de25>] do_softirq+0x65/0xa0 > [<ffffffff8106699a>] local_bh_enable+0x9a/0xb0 > [<ffffffff815176fc>] lock_sock_nested+0xac/0xc0 > [<ffffffff81641f0b>] ? _spin_unlock_bh+0x1b/0x20 > [<ffffffff81517627>] ? release_sock+0xd7/0x100 > [<ffffffff81571838>] tcp_recvmsg+0x38/0xe80 > [<ffffffff812d4c19>] ? cpumask_next_and+0x29/0x50 > [<ffffffff8104b6f4>] ? find_busiest_group+0x244/0xb10 > [<ffffffff810544d2>] ? default_wake_function+0x12/0x20 > [<ffffffff81516cf9>] sock_common_recvmsg+0x39/0x50 > [<ffffffff81516829>] sock_aio_read+0x159/0x160 > [<ffffffff8104dbd3>] ? perf_event_task_sched_out+0x33/0x80 > [<ffffffff810097ac>] ? __switch_to+0x1ac/0x320 > [<ffffffff815166d0>] ? sock_aio_read+0x0/0x160 > [<ffffffff811533bb>] do_sync_readv_writev+0xfb/0x140 > [<ffffffff810853b0>] ? autoremove_wake_function+0x0/0x40 > [<ffffffff811543df>] do_readv_writev+0xcf/0x1f0 > [<ffffffff8156dc0d>] ? do_tcp_getsockopt+0x3d/0x5f0 > [<ffffffff81012879>] ? read_tsc+0x9/0x20 > [<ffffffff8108fc13>] ? ktime_get+0x63/0xe0 > [<ffffffff810650c2>] ? ns_to_timeval+0x12/0x40 > [<ffffffff810896af>] ? hrtimer_get_remaining+0x3f/0x50 > [<ffffffff811546d3>] vfs_readv+0x43/0x60 > [<ffffffff811547d1>] sys_readv+0x51/0x80 > [<ffffffff8100b132>] system_call_fastpath+0x16/0x1b > Code: c1 e5 03 4c 03 6b 20 4d 8b 65 00 49 c7 45 00 00 00 00 00 0f ae e8 48 8b > 53 28 31 c0 f6 c2 10 74 0a 41 f7 06 00 00 1e 00 0f 95 c0 <41> 8b 74 24 6c 49 > 8b 8c 24 b0 01 00 00 85 f6 0f 18 09 0f 85 c0 > RIP [<ffffffffa005afef>] ixgbe_poll+0x9df/0x1710 [ixgbe] > RSP <ffff88080750b8b0> > CR2: 000000000000006c > ---[ end trace 9db4623b9591cd54 ]--- > > addr2line says this is happening on line 2028 below - so a NULL skb > pointer is being passed to skb_is_nonlinear(): > > 1990 static bool ixgbe_clean_rx_irq_ps(struct ixgbe_q_vector *q_vector, > 1991 struct ixgbe_ring *rx_ring, > 1992 int budget) > 1993 { > ..... > 2021 rmb(); > 2022 > 2023 pkt_is_rsc = ixgbe_get_rsc_state(rx_ring, rx_desc); > 2024 > 2025 prefetch(skb->data); > 2026 > 2027 /* pull the header of the skb in if no data is already > present */ > 2028 if (!skb_is_nonlinear(skb)) { > 2029 __skb_put(skb, ixgbe_get_hlen(rx_ring, > rx_desc)); > > Anyone have a guess as to the cause? Or have you seen similar? > > One good clue that we've found is that the problem disappears if we > turn off irq balancing. > It seems like it might be some sort of memory corruption. In order to get into that state the DD bit has to be set for the descriptor and the skb would have to be NULL.
What part is it you are running this on? Is it an 82599? If so you might be running into a FIFO corruption issue, however it was my understanding that the driver wouldn't allow you to enable the packet split mode on that part. If you are using an 82599 you may want to use a newer driver since we rewrote the receive path to allow us to use paged frames without using the hardware packet split mode. Thanks, Alex ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired