Hey Nikolay, Sorry to hear about you random crashes, it is strange they are only occurring on certain HW. I didn't notice the driver (ixgbe) in any of the stack dumps. Are you just inferring that the driver is a likely candidate as the skb head looks to be messed up?
A couple of things come to mind that you might want to test. Along the idea that it the failure seems to be following the hardware, you could try and isolate which part of the hardware. Swap out the NIC's maybe even more the network cables. Sounds like you already changed the memory. Likewise maybe compare the differences between lspci -vv of two systems, one failing other not. Another idea would be to verify all your systems have the same BIOS. Maybe your MMIO isn't set up correctly? It might also be interesting to upgrade the kernel to the latest stable, to see if it might be effected by the network stack. Depending on the results you could isolate the kernel change by using an Source Forge version of ixgbe that should run on both your old kernel and the new. Thanks, -Don Skidmore <donald.c.skidm...@intel.com> ________________________________________ From: Nikolay Borisov [ker...@kyup.com] Sent: Monday, November 16, 2015 5:56 AM To: e1000-devel@lists.sourceforge.net Cc: SiteGround Operations Subject: [E1000-devel] Random crashes with the IXGBE driver. Hello list, I have multiple servers with the following intel NICs: 81:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Subsystem: Super Micro Computer Inc Device 0611 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 149 Region 0: Memory at fb680000 (64-bit, prefetchable) [size=512K] Region 2: I/O ports at f020 [size=32] Region 4: Memory at fbb04000 (64-bit, prefetchable) [size=16K] Expansion ROM at fbe80000 [disabled] [size=512K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] MSI: Enable+ Count=1/1 Maskable+ 64bit+ Address: 00000000fee20000 Data: 402c Masking: 00000000 Pending: 00000000 Capabilities: [70] MSI-X: Enable- Count=64 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00002000 Capabilities: [a0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 <8us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dBВ LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] Vital Product Data Unknown small resource type 06, will not decode more. Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-c2-3a-f4 Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ IOVSta: Migration- Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00 VF offset: 128, stride: 2, Device ID: 10ed Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 00000000fba00000 (64-bit, prefetchable) Region 3: Memory at 00000000fb900000 (64-bit, prefetchable) VF Migration: offset: 00000000, BIR: 0 Kernel driver in use: ixgbe On 2 of those servers I'm seeing multiple crashes at random intervals with the following backtraces: [ 1384.635313] general protection fault: 0000 [#1] SMP [ 1384.635691] Modules linked in: tcp_diag inet_diag act_police cls_basic sch_ingress xt_pkttype xt_state veth netconsole openvswitch gre vxlan ip_tunnel xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio dm_mirror dm_region_hash dm_log ses enclosure ixgbe i2c_i801 lpc_ich mfd_core igb i2c_algo_bit ioapic ipmi_devintf ipmi_si ipmi_msghandler ioatdma dca aacraid [ 1384.640533] CPU: 37 PID: 15089 Comm: sshd Not tainted 3.12.49-clouder2 #2 [ 1384.640807] Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014 [ 1384.641300] task: ffff8834c167e8d0 ti: ffff8834b1eb6000 task.ti: ffff8834b1eb6000 [ 1384.641578] RIP: 0010:[<ffffffff8114ce99>] [<ffffffff8114ce99>] put_page+0x9/0x40 [ 1384.641915] RSP: 0018:ffff8834b1eb7bc8 EFLAGS: 00010246 [ 1384.642186] RAX: 0000000000000000 RBX: ffff883f75d9c500 RCX: ffffffff81c913c0 [ 1384.642460] RDX: 0000000000000380 RSI: 0000000000000000 RDI: 3973657251413850 [ 1384.642733] RBP: ffff8834b1eb7bc8 R08: 000000001577a222 R09: 0000000000000000 [ 1384.643005] R10: ffff883d8d5d6e80 R11: 0000000000000000 R12: ffff883700000246 [ 1384.643276] R13: 0000000000000001 R14: ffff883d8d5d6ef0 R15: 0000000000000000 [ 1384.643548] FS: 00007ff0b7c187c0(0000) GS:ffff883fff440000(0000) knlGS:0000000000000000 [ 1384.643818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1384.644084] CR2: 000000000040cea4 CR3: 00000034dc027000 CR4: 00000000001407e0 [ 1384.644352] Stack: [ 1384.644613] ffff8834b1eb7bf8 ffffffff81578e3c ffff8834b1eb7bf8 ffff883f75d9c500 [ 1384.645077] ffff883d8d5d6e80 ffff883d8d5d72dc ffff8834b1eb7c18 ffffffff81578ee8 [ 1384.645542] ffff8834b1eb7c28 ffff883f75d9c500 ffff8834b1eb7c38 ffffffff81578f46 [ 1384.646007] Call Trace: [ 1384.646279] [<ffffffff81578e3c>] skb_release_data+0x7c/0x100 [ 1384.646562] [<ffffffff81578ee8>] skb_release_all+0x28/0x30 [ 1384.646836] [<ffffffff81578f46>] __kfree_skb+0x16/0xa0 [ 1384.647111] [<ffffffff815d35f0>] tcp_recvmsg+0x990/0xcc0 [ 1384.647387] [<ffffffff815fb509>] inet_recvmsg+0x89/0xa0 [ 1384.647663] [<ffffffff8156c3be>] sock_aio_read+0x13e/0x150 [ 1384.647940] [<ffffffff811a89df>] do_sync_read+0x5f/0xa0 [ 1384.648212] [<ffffffff811a8bed>] ? rw_verify_area+0x5d/0xe0 [ 1384.648483] [<ffffffff811a8ef3>] vfs_read+0x113/0x130 [ 1384.648755] [<ffffffff811a935f>] SyS_read+0x5f/0xb0 [ 1384.649027] [<ffffffff8132e25e>] ? trace_hardirqs_on_thunk+0x3a/0x3c [ 1384.649303] [<ffffffff816496f2>] system_call_fastpath+0x16/0x1b [ 1384.649574] Code: 42 1c e9 5b ff ff ff 4c 89 e7 e8 23 fe ff ff e9 ad fe ff ff 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 <66> f7 07 00 c0 75 17 f0 ff 4f 1c 0f 94 c0 84 c0 75 05 c9 c3 0f [ 1384.653374] RIP [<ffffffff8114ce99>] put_page+0x9/0x40 [ 1384.653694] RSP <ffff8834b1eb7bc8> or [818555.444138] general protection fault: 0000 [#1] SMP [818555.444293] Modules linked in: ixgbe tcp_diag inet_diag act_police cls_basic sch_ingress xt_LOG xt_limit xt_addrtype xt_pkttype xt_state veth netconsole openvswitch gre vxlan ip_tunnel xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio dm_mirror dm_region_hash dm_log ses enclosure i2c_i801 lpc_ich mfd_core igb i2c_algo_bit ioapic ipmi_devintf ipmi_si ipmi_msghandler ioatdma dca aacraid [last unloaded: ixgbe] [818555.447546] CPU: 12 PID: 9179 Comm: nginx Not tainted 3.12.49-clouder4-nproc #1 [818555.447604] Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014 [818555.447667] task: ffff881b1d66e8d0 ti: ffff881ecbec8000 task.ti: ffff881ecbec8000 [818555.447724] RIP: 0010:[<ffffffff8157d5ec>] [<ffffffff8157d5ec>] skb_copy_datagram_iovec+0x19c/0x2b0 [818555.447838] RSP: 0018:ffff881ecbec9b68 EFLAGS: 00010206 [818555.447892] RAX: 0000000000000000 RBX: 000000000000010a RCX: 000000000000010a [818555.447949] RDX: 000000000000010a RSI: 0000000000000000 RDI: ffff883bd56f3500 [818555.448006] RBP: ffff881ecbec9bc8 R08: 0000000000000000 R09: 0000000000000000 [818555.448063] R10: ffff8832974df500 R11: 0000000000000000 R12: 0000000000000000 [818555.448336] R13: 0731000007300000 R14: 0000000000000000 R15: 0000000000000000 [818555.448609] FS: 00002ad13b927a80(0000) GS:ffff883fff080000(0000) knlGS:0000000000000000 [818555.448883] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [818555.449153] CR2: ffffffffff600400 CR3: 0000001ecf2ab000 CR4: 00000000001407e0 [818555.449425] Stack: [818555.449688] ffff8832974df500 0000000000000040 ffff881ecbec9e48 ffff881ecbec9d98 [818555.450148] ffff881ecbec9e88 ffffffff00000000 000000000000000a ffff883bd56f3500 [818555.450605] 0000000000000000 ffff8832974df95c ffff8832974df570 0000000000000000 [818555.451060] Call Trace: [818555.451327] [<ffffffff815d3cdb>] tcp_recvmsg+0x75b/0xcb0 [818555.451594] [<ffffffff815fbe19>] inet_recvmsg+0x89/0xa0 [818555.451864] [<ffffffff8156ece3>] sock_recvmsg+0xa3/0xd0 [818555.452138] [<ffffffff811f0959>] ? ep_send_events_proc+0xa9/0x170 [818555.452412] [<ffffffff8156edfe>] SYSC_recvfrom+0xee/0x170 [818555.452684] [<ffffffff8156ee8e>] SyS_recvfrom+0xe/0x10 [818555.452957] [<ffffffff81649ff2>] system_call_fastpath+0x16/0x1b [818555.453227] Code: 0f b6 10 44 39 fa 0f 8f 23 ff ff ff 4c 8b 68 08 4d 85 ed 74 5b 44 89 f0 0f 1f 80 00 00 00 00 42 8d 14 23 39 c2 0f 8c a6 00 00 00 <45> 8b 7d 68 41 01 c7 45 89 fe 45 29 e6 45 85 f6 7e 27 41 39 de [818555.456980] RIP [<ffffffff8157d5ec>] skb_copy_datagram_iovec+0x19c/0x2b0 [818555.457298] RSP <ffff881ecbec9b68> or [409292.391088] ------------[ cut here ]------------ [409292.391369] kernel BUG at mm/slub.c:3336! [409292.391640] invalid opcode: 0000 [#1] SMP [409292.392005] Modules linked in: tcp_diag inet_diag act_police cls_basic sch_ingress xt_LOG xt_limit xt_addrtype xt_pkttype xt_state veth netconsole openvswitch gre vxlan ip_tunnel xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio dm_mirror dm_region_hash dm_log ses enclosure ixgbe i2c_i801 lpc_ich mfd_core igb i2c_algo_bit ioapic ipmi_devintf ipmi_si ipmi_msghandler ioatdma dca aacraid [409292.396984] CPU: 10 PID: 37087 Comm: kworker/u80:0 Not tainted 3.12.49-clouder4-nproc #1 [409292.397260] Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014 [409292.397761] Workqueue: dm-thin do_worker [dm_thin_pool] [409292.398077] task: ffff883f558c8810 ti: ffff882bf6ab2000 task.ti: ffff882bf6ab2000 [409292.398351] RIP: 0010:[<ffffffff8118dd62>] [<ffffffff8118dd62>] kfree+0x172/0x180 [409292.398682] RSP: 0018:ffff883fff003cb0 EFLAGS: 00010246 [409292.398951] RAX: 06fc000000000000 RBX: ffff883f00000106 RCX: 0000000000000001 [409292.399225] RDX: 0000000000000000 RSI: 0000003fac51189a RDI: ffffea00fc000000 [409292.399500] RBP: ffff883fff003cd0 R08: 0000000000000000 R09: 0000000000000003 [409292.399771] R10: 0000000000000003 R11: ffff883fff003e68 R12: ffff883f000003c6 [409292.400044] R13: ffffffff8157967e R14: ffff881fce14d130 R15: ffff883fd2ebb140 [409292.400316] FS: 0000000000000000(0000) GS:ffff883fff000000(0000) knlGS:0000000000000000 [409292.400590] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [409292.400862] CR2: 0000000000f15720 CR3: 0000001c86ecb000 CR4: 00000000001407e0 [409292.401137] Stack: [409292.401402] ffff881d12fdecc0 ffff881cae745700 ffff883f000003c6 ffff881fce14d120 [409292.401866] ffff883fff003ce0 ffffffff8157967e ffff883fff003d10 ffffffff8157979c [409292.402329] 0000000000000000 ffff881cae745700 ffff881cae745700 ffff881fce14d120 [409292.402792] Call Trace: [409292.403057] <IRQ> [409292.403107] [409292.403422] [<ffffffff8157967e>] skb_free_head+0x1e/0x80 [409292.403692] [<ffffffff8157979c>] skb_release_data+0xbc/0x100 [409292.403963] [<ffffffff81579808>] skb_release_all+0x28/0x30 [409292.404233] [<ffffffff81579866>] __kfree_skb+0x16/0xa0 [409292.404507] [<ffffffff81579c31>] consume_skb+0x31/0x90 [409292.404780] [<ffffffff815850dd>] dev_kfree_skb_any+0x3d/0x50 [409292.405055] [<ffffffffa00b311c>] ixgbe_poll+0xec/0x6b0 [ixgbe] [409292.405328] [<ffffffff8158ae1c>] net_rx_action+0x12c/0x280 [409292.405604] [<ffffffff8108ef77>] __do_softirq+0x137/0x2e0 [409292.405878] [<ffffffff8164b78c>] call_softirq+0x1c/0x30 [409292.406153] [<ffffffff8104a35d>] do_softirq+0x8d/0xc0 [409292.406426] [<ffffffff8108eb15>] irq_exit+0x95/0xa0 [409292.406695] [<ffffffff8164bcf6>] do_IRQ+0x66/0xe0 [409292.406968] [<ffffffff8164956f>] common_interrupt+0x6f/0x6f [409292.407241] <EOI> [409292.407290] [409292.407603] [<ffffffff81141107>] ? mempool_free_slab+0x17/0x20 [409292.407879] [<ffffffffa014a264>] ? do_worker+0xd4/0x270 [dm_thin_pool] [409292.408154] [<ffffffffa014a281>] ? do_worker+0xf1/0x270 [dm_thin_pool] [409292.408432] [<ffffffff810a64d5>] process_one_work+0x195/0x550 [409292.408700] [<ffffffff810a877a>] worker_thread+0x13a/0x430 [409292.408967] [<ffffffff810a8640>] ? manage_workers+0x2c0/0x2c0 [409292.409235] [<ffffffff810ae77e>] kthread+0xce/0xe0 [409292.409500] [<ffffffff810ae6b0>] ? kthread_freezable_should_stop+0x80/0x80 [409292.409770] [<ffffffff81649f48>] ret_from_fork+0x58/0x90 [409292.410038] [<ffffffff810ae6b0>] ? kthread_freezable_should_stop+0x80/0x80 [409292.410308] Code: 2a 48 8b 07 31 f6 f6 c4 40 74 03 8b 77 68 e8 16 b1 fb ff e9 73 ff ff ff 48 8b 47 30 48 8b 17 66 85 d2 48 0f 48 f8 e9 fe fe ff ff <0f> 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 41 [409292.414104] RIP [<ffffffff8118dd62>] kfree+0x172/0x180 [409292.414423] RSP <ffff883fff003cb0> All of these are actually due to memory corruptions of the skb->head pointer. And strangely enough, collating the information from the various crashes reveal the following pattern : sk_buff.head = ffff883700000106 sk_buff.head = ffff883400000106 sk_buff.head = ffff883500000106 sk_buff.head = ffff883c00000106 sk_buff.head = ffff883800000106 sk_buff.head = ffff883f00000106 sk_buff.head = ffff883100000106 I've obtained those by manually reading the assembly which lead to the crash, I have also extracted the respective SKBs. At first I thought this could be a memory problem but following a replacement of all the memory banks didn't change the situation. Initially I observed those crashes on 3.12.47 kernel and later updated to 3.12.49 - still no change in situation. I have tested the 4.1.5 (ouf of tree IXGBE driver from sf.net page) with 3.12.49 - still crashes occurr. I have also reverted to using 3.23 (still out of tree driver - no change). I've yet to test with the stock driver which I believe for kernel 3.12.49 is 3.15.1-k. At this point I'm inclined to believe that this might be due to a driver bug, which causes corruptions in the SKB which eventually lead to the aforementioned crashes. Any ideas or pointers how I may proceed to find the root cause of those crashes? I'm happy to provide any information you might consider important. What puzzles me is why out of 7 servers with similar hardware, only 2 are exhibiting this behavior. Regards, Nikolay ------------------------------------------------------------------------------ Presto, an open source distributed SQL query engine for big data, initially developed by Facebook, enables you to easily query your data on Hadoop in a more interactive manner. Teradata is also now providing full enterprise support for Presto. Download a free open source copy now. http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired ------------------------------------------------------------------------------ Presto, an open source distributed SQL query engine for big data, initially developed by Facebook, enables you to easily query your data on Hadoop in a more interactive manner. Teradata is also now providing full enterprise support for Presto. Download a free open source copy now. http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired