Hello list, I have multiple servers with the following intel NICs:
81:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) Subsystem: Super Micro Computer Inc Device 0611 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 149 Region 0: Memory at fb680000 (64-bit, prefetchable) [size=512K] Region 2: I/O ports at f020 [size=32] Region 4: Memory at fbb04000 (64-bit, prefetchable) [size=16K] Expansion ROM at fbe80000 [disabled] [size=512K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] MSI: Enable+ Count=1/1 Maskable+ 64bit+ Address: 00000000fee20000 Data: 402c Masking: 00000000 Pending: 00000000 Capabilities: [70] MSI-X: Enable- Count=64 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00002000 Capabilities: [a0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 <8us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dBВ LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] Vital Product Data Unknown small resource type 06, will not decode more. Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-c2-3a-f4 Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ IOVSta: Migration- Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00 VF offset: 128, stride: 2, Device ID: 10ed Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 00000000fba00000 (64-bit, prefetchable) Region 3: Memory at 00000000fb900000 (64-bit, prefetchable) VF Migration: offset: 00000000, BIR: 0 Kernel driver in use: ixgbe On 2 of those servers I'm seeing multiple crashes at random intervals with the following backtraces: [ 1384.635313] general protection fault: 0000 [#1] SMP [ 1384.635691] Modules linked in: tcp_diag inet_diag act_police cls_basic sch_ingress xt_pkttype xt_state veth netconsole openvswitch gre vxlan ip_tunnel xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio dm_mirror dm_region_hash dm_log ses enclosure ixgbe i2c_i801 lpc_ich mfd_core igb i2c_algo_bit ioapic ipmi_devintf ipmi_si ipmi_msghandler ioatdma dca aacraid [ 1384.640533] CPU: 37 PID: 15089 Comm: sshd Not tainted 3.12.49-clouder2 #2 [ 1384.640807] Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014 [ 1384.641300] task: ffff8834c167e8d0 ti: ffff8834b1eb6000 task.ti: ffff8834b1eb6000 [ 1384.641578] RIP: 0010:[<ffffffff8114ce99>] [<ffffffff8114ce99>] put_page+0x9/0x40 [ 1384.641915] RSP: 0018:ffff8834b1eb7bc8 EFLAGS: 00010246 [ 1384.642186] RAX: 0000000000000000 RBX: ffff883f75d9c500 RCX: ffffffff81c913c0 [ 1384.642460] RDX: 0000000000000380 RSI: 0000000000000000 RDI: 3973657251413850 [ 1384.642733] RBP: ffff8834b1eb7bc8 R08: 000000001577a222 R09: 0000000000000000 [ 1384.643005] R10: ffff883d8d5d6e80 R11: 0000000000000000 R12: ffff883700000246 [ 1384.643276] R13: 0000000000000001 R14: ffff883d8d5d6ef0 R15: 0000000000000000 [ 1384.643548] FS: 00007ff0b7c187c0(0000) GS:ffff883fff440000(0000) knlGS:0000000000000000 [ 1384.643818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1384.644084] CR2: 000000000040cea4 CR3: 00000034dc027000 CR4: 00000000001407e0 [ 1384.644352] Stack: [ 1384.644613] ffff8834b1eb7bf8 ffffffff81578e3c ffff8834b1eb7bf8 ffff883f75d9c500 [ 1384.645077] ffff883d8d5d6e80 ffff883d8d5d72dc ffff8834b1eb7c18 ffffffff81578ee8 [ 1384.645542] ffff8834b1eb7c28 ffff883f75d9c500 ffff8834b1eb7c38 ffffffff81578f46 [ 1384.646007] Call Trace: [ 1384.646279] [<ffffffff81578e3c>] skb_release_data+0x7c/0x100 [ 1384.646562] [<ffffffff81578ee8>] skb_release_all+0x28/0x30 [ 1384.646836] [<ffffffff81578f46>] __kfree_skb+0x16/0xa0 [ 1384.647111] [<ffffffff815d35f0>] tcp_recvmsg+0x990/0xcc0 [ 1384.647387] [<ffffffff815fb509>] inet_recvmsg+0x89/0xa0 [ 1384.647663] [<ffffffff8156c3be>] sock_aio_read+0x13e/0x150 [ 1384.647940] [<ffffffff811a89df>] do_sync_read+0x5f/0xa0 [ 1384.648212] [<ffffffff811a8bed>] ? rw_verify_area+0x5d/0xe0 [ 1384.648483] [<ffffffff811a8ef3>] vfs_read+0x113/0x130 [ 1384.648755] [<ffffffff811a935f>] SyS_read+0x5f/0xb0 [ 1384.649027] [<ffffffff8132e25e>] ? trace_hardirqs_on_thunk+0x3a/0x3c [ 1384.649303] [<ffffffff816496f2>] system_call_fastpath+0x16/0x1b [ 1384.649574] Code: 42 1c e9 5b ff ff ff 4c 89 e7 e8 23 fe ff ff e9 ad fe ff ff 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 <66> f7 07 00 c0 75 17 f0 ff 4f 1c 0f 94 c0 84 c0 75 05 c9 c3 0f [ 1384.653374] RIP [<ffffffff8114ce99>] put_page+0x9/0x40 [ 1384.653694] RSP <ffff8834b1eb7bc8> or [818555.444138] general protection fault: 0000 [#1] SMP [818555.444293] Modules linked in: ixgbe tcp_diag inet_diag act_police cls_basic sch_ingress xt_LOG xt_limit xt_addrtype xt_pkttype xt_state veth netconsole openvswitch gre vxlan ip_tunnel xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio dm_mirror dm_region_hash dm_log ses enclosure i2c_i801 lpc_ich mfd_core igb i2c_algo_bit ioapic ipmi_devintf ipmi_si ipmi_msghandler ioatdma dca aacraid [last unloaded: ixgbe] [818555.447546] CPU: 12 PID: 9179 Comm: nginx Not tainted 3.12.49-clouder4-nproc #1 [818555.447604] Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014 [818555.447667] task: ffff881b1d66e8d0 ti: ffff881ecbec8000 task.ti: ffff881ecbec8000 [818555.447724] RIP: 0010:[<ffffffff8157d5ec>] [<ffffffff8157d5ec>] skb_copy_datagram_iovec+0x19c/0x2b0 [818555.447838] RSP: 0018:ffff881ecbec9b68 EFLAGS: 00010206 [818555.447892] RAX: 0000000000000000 RBX: 000000000000010a RCX: 000000000000010a [818555.447949] RDX: 000000000000010a RSI: 0000000000000000 RDI: ffff883bd56f3500 [818555.448006] RBP: ffff881ecbec9bc8 R08: 0000000000000000 R09: 0000000000000000 [818555.448063] R10: ffff8832974df500 R11: 0000000000000000 R12: 0000000000000000 [818555.448336] R13: 0731000007300000 R14: 0000000000000000 R15: 0000000000000000 [818555.448609] FS: 00002ad13b927a80(0000) GS:ffff883fff080000(0000) knlGS:0000000000000000 [818555.448883] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [818555.449153] CR2: ffffffffff600400 CR3: 0000001ecf2ab000 CR4: 00000000001407e0 [818555.449425] Stack: [818555.449688] ffff8832974df500 0000000000000040 ffff881ecbec9e48 ffff881ecbec9d98 [818555.450148] ffff881ecbec9e88 ffffffff00000000 000000000000000a ffff883bd56f3500 [818555.450605] 0000000000000000 ffff8832974df95c ffff8832974df570 0000000000000000 [818555.451060] Call Trace: [818555.451327] [<ffffffff815d3cdb>] tcp_recvmsg+0x75b/0xcb0 [818555.451594] [<ffffffff815fbe19>] inet_recvmsg+0x89/0xa0 [818555.451864] [<ffffffff8156ece3>] sock_recvmsg+0xa3/0xd0 [818555.452138] [<ffffffff811f0959>] ? ep_send_events_proc+0xa9/0x170 [818555.452412] [<ffffffff8156edfe>] SYSC_recvfrom+0xee/0x170 [818555.452684] [<ffffffff8156ee8e>] SyS_recvfrom+0xe/0x10 [818555.452957] [<ffffffff81649ff2>] system_call_fastpath+0x16/0x1b [818555.453227] Code: 0f b6 10 44 39 fa 0f 8f 23 ff ff ff 4c 8b 68 08 4d 85 ed 74 5b 44 89 f0 0f 1f 80 00 00 00 00 42 8d 14 23 39 c2 0f 8c a6 00 00 00 <45> 8b 7d 68 41 01 c7 45 89 fe 45 29 e6 45 85 f6 7e 27 41 39 de [818555.456980] RIP [<ffffffff8157d5ec>] skb_copy_datagram_iovec+0x19c/0x2b0 [818555.457298] RSP <ffff881ecbec9b68> or [409292.391088] ------------[ cut here ]------------ [409292.391369] kernel BUG at mm/slub.c:3336! [409292.391640] invalid opcode: 0000 [#1] SMP [409292.392005] Modules linked in: tcp_diag inet_diag act_police cls_basic sch_ingress xt_LOG xt_limit xt_addrtype xt_pkttype xt_state veth netconsole openvswitch gre vxlan ip_tunnel xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio dm_mirror dm_region_hash dm_log ses enclosure ixgbe i2c_i801 lpc_ich mfd_core igb i2c_algo_bit ioapic ipmi_devintf ipmi_si ipmi_msghandler ioatdma dca aacraid [409292.396984] CPU: 10 PID: 37087 Comm: kworker/u80:0 Not tainted 3.12.49-clouder4-nproc #1 [409292.397260] Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014 [409292.397761] Workqueue: dm-thin do_worker [dm_thin_pool] [409292.398077] task: ffff883f558c8810 ti: ffff882bf6ab2000 task.ti: ffff882bf6ab2000 [409292.398351] RIP: 0010:[<ffffffff8118dd62>] [<ffffffff8118dd62>] kfree+0x172/0x180 [409292.398682] RSP: 0018:ffff883fff003cb0 EFLAGS: 00010246 [409292.398951] RAX: 06fc000000000000 RBX: ffff883f00000106 RCX: 0000000000000001 [409292.399225] RDX: 0000000000000000 RSI: 0000003fac51189a RDI: ffffea00fc000000 [409292.399500] RBP: ffff883fff003cd0 R08: 0000000000000000 R09: 0000000000000003 [409292.399771] R10: 0000000000000003 R11: ffff883fff003e68 R12: ffff883f000003c6 [409292.400044] R13: ffffffff8157967e R14: ffff881fce14d130 R15: ffff883fd2ebb140 [409292.400316] FS: 0000000000000000(0000) GS:ffff883fff000000(0000) knlGS:0000000000000000 [409292.400590] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [409292.400862] CR2: 0000000000f15720 CR3: 0000001c86ecb000 CR4: 00000000001407e0 [409292.401137] Stack: [409292.401402] ffff881d12fdecc0 ffff881cae745700 ffff883f000003c6 ffff881fce14d120 [409292.401866] ffff883fff003ce0 ffffffff8157967e ffff883fff003d10 ffffffff8157979c [409292.402329] 0000000000000000 ffff881cae745700 ffff881cae745700 ffff881fce14d120 [409292.402792] Call Trace: [409292.403057] <IRQ> [409292.403107] [409292.403422] [<ffffffff8157967e>] skb_free_head+0x1e/0x80 [409292.403692] [<ffffffff8157979c>] skb_release_data+0xbc/0x100 [409292.403963] [<ffffffff81579808>] skb_release_all+0x28/0x30 [409292.404233] [<ffffffff81579866>] __kfree_skb+0x16/0xa0 [409292.404507] [<ffffffff81579c31>] consume_skb+0x31/0x90 [409292.404780] [<ffffffff815850dd>] dev_kfree_skb_any+0x3d/0x50 [409292.405055] [<ffffffffa00b311c>] ixgbe_poll+0xec/0x6b0 [ixgbe] [409292.405328] [<ffffffff8158ae1c>] net_rx_action+0x12c/0x280 [409292.405604] [<ffffffff8108ef77>] __do_softirq+0x137/0x2e0 [409292.405878] [<ffffffff8164b78c>] call_softirq+0x1c/0x30 [409292.406153] [<ffffffff8104a35d>] do_softirq+0x8d/0xc0 [409292.406426] [<ffffffff8108eb15>] irq_exit+0x95/0xa0 [409292.406695] [<ffffffff8164bcf6>] do_IRQ+0x66/0xe0 [409292.406968] [<ffffffff8164956f>] common_interrupt+0x6f/0x6f [409292.407241] <EOI> [409292.407290] [409292.407603] [<ffffffff81141107>] ? mempool_free_slab+0x17/0x20 [409292.407879] [<ffffffffa014a264>] ? do_worker+0xd4/0x270 [dm_thin_pool] [409292.408154] [<ffffffffa014a281>] ? do_worker+0xf1/0x270 [dm_thin_pool] [409292.408432] [<ffffffff810a64d5>] process_one_work+0x195/0x550 [409292.408700] [<ffffffff810a877a>] worker_thread+0x13a/0x430 [409292.408967] [<ffffffff810a8640>] ? manage_workers+0x2c0/0x2c0 [409292.409235] [<ffffffff810ae77e>] kthread+0xce/0xe0 [409292.409500] [<ffffffff810ae6b0>] ? kthread_freezable_should_stop+0x80/0x80 [409292.409770] [<ffffffff81649f48>] ret_from_fork+0x58/0x90 [409292.410038] [<ffffffff810ae6b0>] ? kthread_freezable_should_stop+0x80/0x80 [409292.410308] Code: 2a 48 8b 07 31 f6 f6 c4 40 74 03 8b 77 68 e8 16 b1 fb ff e9 73 ff ff ff 48 8b 47 30 48 8b 17 66 85 d2 48 0f 48 f8 e9 fe fe ff ff <0f> 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 41 [409292.414104] RIP [<ffffffff8118dd62>] kfree+0x172/0x180 [409292.414423] RSP <ffff883fff003cb0> All of these are actually due to memory corruptions of the skb->head pointer. And strangely enough, collating the information from the various crashes reveal the following pattern : sk_buff.head = ffff883700000106 sk_buff.head = ffff883400000106 sk_buff.head = ffff883500000106 sk_buff.head = ffff883c00000106 sk_buff.head = ffff883800000106 sk_buff.head = ffff883f00000106 sk_buff.head = ffff883100000106 I've obtained those by manually reading the assembly which lead to the crash, I have also extracted the respective SKBs. At first I thought this could be a memory problem but following a replacement of all the memory banks didn't change the situation. Initially I observed those crashes on 3.12.47 kernel and later updated to 3.12.49 - still no change in situation. I have tested the 4.1.5 (ouf of tree IXGBE driver from sf.net page) with 3.12.49 - still crashes occurr. I have also reverted to using 3.23 (still out of tree driver - no change). I've yet to test with the stock driver which I believe for kernel 3.12.49 is 3.15.1-k. At this point I'm inclined to believe that this might be due to a driver bug, which causes corruptions in the SKB which eventually lead to the aforementioned crashes. Any ideas or pointers how I may proceed to find the root cause of those crashes? I'm happy to provide any information you might consider important. What puzzles me is why out of 7 servers with similar hardware, only 2 are exhibiting this behavior. Regards, Nikolay ------------------------------------------------------------------------------ Presto, an open source distributed SQL query engine for big data, initially developed by Facebook, enables you to easily query your data on Hadoop in a more interactive manner. Teradata is also now providing full enterprise support for Presto. Download a free open source copy now. http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired