Hey Todd, Am trying 3.14.29 now...
By the way, one thing I did with 3.14.28 before the crash is I ifconfig down'ed one of the 10G interfaces a day or two earlier. Not sure if related, but pointing that out just in case useful. Thanks, Chris On Mon, 19 Jan 2015, Fujinaka, Todd wrote: > Usually this isn't an issue in the driver but in the kernel. Have you > tried the latest stable or the latest in 3.14 (which is 3.14.29?) > > Todd Fujinaka > Software Application Engineer > Networking Division (ND) > Intel Corporation > todd.fujin...@intel.com > (503) 712-4565 > > -----Original Message----- > From: Chris Caputo [mailto:ccap...@alt.net] > Sent: Saturday, January 17, 2015 11:34 PM > To: e1000-devel@lists.sourceforge.net > Subject: [E1000-devel] kernel 3.14.28 BUG_ON in skb_segment() called by > ixgbe_poll() and napi > > Hi. I am running linux kernel 3.14.28 with related hardware as follows: > > 2x Intel Xeon E5420 > SuperMicro X7DBE+ Rev 2.01 > Intel 5000P (Blackford) Chipset > HotLava Systems Tambora 64G6 Part #6ST2830A2, PCI-e 2.0 (5GT/s), x8, 6-port, > Intel 82599ES based, SFP+ 32GB RAM > > Got: > > [375129.789047] BUG: unable to handle kernel NULL pointer dereference at > 0000000 [375129.790004] [<ffffffff813a16f5>] napi_gro_flush+0x65/0x80 > [375129.790004] [<ffffffff813a1729>] napi_complete+0x19/0x30 [375129.790004] > [<ffffffff812f9fbe>] ixgbe_poll+0x4ee/0x940 [375129.790004] > [<ffffffff813a183b>] net_rx_action+0xfb/0x1a0 [375129.790004] > [<ffffffff8104ec3c>] __do_softirq+0xdc/0x1f0 [375129.790004] > [<ffffffff8104ef5d>] irq_exit+0x9d/0xb0 [375129.790004] [<ffffffff81003e33>] > do_IRQ+0x53/0xf0 [375129.790004] [<ffffffff814fddaa>] > common_interrupt+0x6a/0x6a [375129.790004] <EOI> [375129.790004] > [<ffffffff81074ac8>] ? sched_clock_cpu+0x88/0xb0 [375129.790004] > [<ffffffff8100a526>] ? default_idle+0x6/0x10 [375129.790004] > [<ffffffff8100ac96>] arch_cpu_idle+0x16/0x20 [375129.790004] > [<ffffffff810863c1>] cpu_startup_entry+0x91/0x180 [375129.790004] > [<ffffffff8102c13f>] start_secondary+0x19f/0x1f0 [375129.790004] Code: 4c 24 > 60 eb 21 0f 1f 80 00 00 00 00 41 83 c5 01 49 83 c4 10 > 48 83 c1 10 41 39 c3 0f 86 7b 01 00 00 41 89 c7 89 c2 45 39 > e9 7f 37 <41> 8b 46 > 6c 41 39 46 68 0f 85 6d 03 00 00 45 8b a6 c4 00 00 00 > [375129.790004] RIP [<ffffffff8139567f>] skb_segment+0x5df/0x980 > [375129.790004] RSP <ffff88082fcc3828> [375129.790004] CR2: 000000000000006c > [375129.790004] ---[ end trace ce413143217a96ad ]--- [375129.790004] Kernel > panic - not syncing: Fatal exception in interrupt [375129.790004] Kernel > Offset: 0x0 from 0xffffffff81000000 (relocation range: 0x > [ffffffff80000000-0xffffffff9fffffff) > [375129.790004] Rebooting in 10 seconds.. > > And then just after rebooting: > > [ 53.268587] BUG: unable to handle kernel NULL pointer dereference at > 00000000 > [ 53.269532] [<ffffffff813a1729>] napi_complete+0x19/0x30 > [ 53.269532] [<ffffffff812f9fbe>] ixgbe_poll+0x4ee/0x940 > [ 53.269532] [<ffffffff812032c4>] ? timerqueue_del+0x24/0x70 > [ 53.269532] [<ffffffff81203230>] ? timerqueue_add+0x60/0xb0 > [ 53.269532] [<ffffffff813a183b>] net_rx_action+0xfb/0x1a0 > [ 53.269532] [<ffffffff8104ec3c>] __do_softirq+0xdc/0x1f0 > [ 53.269532] [<ffffffff8104ef5d>] irq_exit+0x9d/0xb0 > [ 53.269532] [<ffffffff81003e33>] do_IRQ+0x53/0xf0 > [ 53.269532] [<ffffffff814fddaa>] common_interrupt+0x6a/0x6a > [ 53.269532] <EOI> > [ 53.269532] [<ffffffff8100a526>] ? default_idle+0x6/0x10 > [ 53.269532] [<ffffffff8100ac96>] arch_cpu_idle+0x16/0x20 > [ 53.269532] [<ffffffff810863c1>] cpu_startup_entry+0x91/0x180 > [ 53.269532] [<ffffffff8102c13f>] start_secondary+0x19f/0x1f0 > [ 53.269532] Code: 4c 24 60 eb 21 0f 1f 80 00 00 00 00 41 83 c5 01 49 83 c4 > 10 > [ 48 83 c1 10 41 39 c3 0f 86 7b 01 00 00 41 89 c7 89 c2 45 39 e9 > 7f 37 <41> 8b 46 > [ 6c 41 39 46 68 0f 85 6d 03 00 00 45 8b a6 c4 00 00 00 > [ 53.269532] RIP [<ffffffff8139567f>] skb_segment+0x5df/0x980 > [ 53.269532] RSP <ffff88082fd43840> > [ 53.269532] CR2: 000000000000006c > [ 53.269532] ---[ end trace 1c1a68627fa9d6de ]--- > [ 53.269532] Kernel panic - not syncing: Fatal exception in interrupt > [ 53.269532] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: > 0xffffffff80000000-0xffffffff9fffffff) > [ 53.269532] Rebooting in 10 seconds.. > > Rebooted again and the system stayed up, but I don't know if it will happen > again. > > The code which triggered the BUG is in skb_segment() in net/core/skbuff.c > (line 3001 of kernel 3.14.28): > > while (pos < offset + len) { > if (i >= nfrags) { > >>>> BUG_ON(skb_headlen(list_skb)); > > i = 0; > > Since the call stack includes ixgbe_poll() each time, I wonder if this might > be an issue with the ixgbe driver or something others have seen? > > Suggestions most welcome. > > Thanks, > Chris > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > E1000-devel mailing list > E1000-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/e1000-devel > To learn more about Intel® Ethernet, visit > http://communities.intel.com/community/wired > ------------------------------------------------------------------------------ New Year. New Location. New Benefits. New Data Center in Ashburn, VA. GigeNET is offering a free month of service with a new server in Ashburn. Choose from 2 high performing configs, both with 100TB of bandwidth. Higher redundancy.Lower latency.Increased capacity.Completely compliant. http://p.sf.net/sfu/gigenet _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired