Hello list, 

I have multiple servers with the following intel NICs:

81:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ 
Network Connection (rev 01)
        Subsystem: Super Micro Computer Inc Device 0611
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- 
<MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 149
        Region 0: Memory at fb680000 (64-bit, prefetchable) [size=512K]
        Region 2: I/O ports at f020 [size=32]
        Region 4: Memory at fbb04000 (64-bit, prefetchable) [size=16K]
        Expansion ROM at fbe80000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable+ 64bit+
                Address: 00000000fee20000  Data: 402c
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable- Count=64 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, 
L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
Unsupported+
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- 
TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 
unlimited, L1 <8us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- 
BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, 
OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, 
OBFF Disabled
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dBВ
                LnkSta2: Current De-emphasis Level: -6dB, 
EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, 
LinkEqualizationRequest-
        Capabilities: [e0] Vital Product Data
                Unknown small resource type 06, will not decode more.
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-c2-3a-f4
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function 
Dependency Link: 00
                VF offset: 128, stride: 2, Device ID: 10ed
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 00000000fba00000 (64-bit, prefetchable)
                Region 3: Memory at 00000000fb900000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Kernel driver in use: ixgbe


On 2 of those servers I'm seeing multiple crashes at random intervals with the 
following backtraces: 

[ 1384.635313] general protection fault: 0000 [#1] SMP 
[ 1384.635691] Modules linked in: tcp_diag inet_diag act_police cls_basic 
sch_ingress xt_pkttype xt_state veth netconsole openvswitch gre vxlan ip_tunnel 
xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ib_ipoib 
rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad 
ib_core ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio dm_mirror 
dm_region_hash dm_log ses enclosure ixgbe i2c_i801 lpc_ich mfd_core igb 
i2c_algo_bit ioapic ipmi_devintf ipmi_si ipmi_msghandler ioatdma dca aacraid
[ 1384.640533] CPU: 37 PID: 15089 Comm: sshd Not tainted 3.12.49-clouder2 #2
[ 1384.640807] Hardware name: Supermicro 
PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014
[ 1384.641300] task: ffff8834c167e8d0 ti: ffff8834b1eb6000 task.ti: 
ffff8834b1eb6000
[ 1384.641578] RIP: 0010:[<ffffffff8114ce99>]  [<ffffffff8114ce99>] 
put_page+0x9/0x40
[ 1384.641915] RSP: 0018:ffff8834b1eb7bc8  EFLAGS: 00010246
[ 1384.642186] RAX: 0000000000000000 RBX: ffff883f75d9c500 RCX: ffffffff81c913c0
[ 1384.642460] RDX: 0000000000000380 RSI: 0000000000000000 RDI: 3973657251413850
[ 1384.642733] RBP: ffff8834b1eb7bc8 R08: 000000001577a222 R09: 0000000000000000
[ 1384.643005] R10: ffff883d8d5d6e80 R11: 0000000000000000 R12: ffff883700000246
[ 1384.643276] R13: 0000000000000001 R14: ffff883d8d5d6ef0 R15: 0000000000000000
[ 1384.643548] FS:  00007ff0b7c187c0(0000) GS:ffff883fff440000(0000) 
knlGS:0000000000000000
[ 1384.643818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1384.644084] CR2: 000000000040cea4 CR3: 00000034dc027000 CR4: 00000000001407e0
[ 1384.644352] Stack:
[ 1384.644613]  ffff8834b1eb7bf8 ffffffff81578e3c ffff8834b1eb7bf8 
ffff883f75d9c500
[ 1384.645077]  ffff883d8d5d6e80 ffff883d8d5d72dc ffff8834b1eb7c18 
ffffffff81578ee8
[ 1384.645542]  ffff8834b1eb7c28 ffff883f75d9c500 ffff8834b1eb7c38 
ffffffff81578f46
[ 1384.646007] Call Trace:
[ 1384.646279]  [<ffffffff81578e3c>] skb_release_data+0x7c/0x100
[ 1384.646562]  [<ffffffff81578ee8>] skb_release_all+0x28/0x30
[ 1384.646836]  [<ffffffff81578f46>] __kfree_skb+0x16/0xa0
[ 1384.647111]  [<ffffffff815d35f0>] tcp_recvmsg+0x990/0xcc0
[ 1384.647387]  [<ffffffff815fb509>] inet_recvmsg+0x89/0xa0
[ 1384.647663]  [<ffffffff8156c3be>] sock_aio_read+0x13e/0x150
[ 1384.647940]  [<ffffffff811a89df>] do_sync_read+0x5f/0xa0
[ 1384.648212]  [<ffffffff811a8bed>] ? rw_verify_area+0x5d/0xe0
[ 1384.648483]  [<ffffffff811a8ef3>] vfs_read+0x113/0x130
[ 1384.648755]  [<ffffffff811a935f>] SyS_read+0x5f/0xb0
[ 1384.649027]  [<ffffffff8132e25e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[ 1384.649303]  [<ffffffff816496f2>] system_call_fastpath+0x16/0x1b
[ 1384.649574] Code: 42 1c e9 5b ff ff ff 4c 89 e7 e8 23 fe ff ff e9 ad fe ff 
ff 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 <66> f7 
07 00 c0 75 17 f0 ff 4f 1c 0f 94 c0 84 c0 75 05 c9 c3 0f 
[ 1384.653374] RIP  [<ffffffff8114ce99>] put_page+0x9/0x40
[ 1384.653694]  RSP <ffff8834b1eb7bc8>


or 

[818555.444138] general protection fault: 0000 [#1] SMP 
[818555.444293] Modules linked in: ixgbe tcp_diag inet_diag act_police 
cls_basic sch_ingress xt_LOG xt_limit xt_addrtype xt_pkttype xt_state veth 
netconsole openvswitch gre vxlan ip_tunnel xt_owner xt_conntrack iptable_mangle 
xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT 
nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm 
ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core ext2 dm_thin_pool dm_bio_prison 
dm_persistent_data dm_bufio dm_mirror dm_region_hash dm_log ses enclosure 
i2c_i801 lpc_ich mfd_core igb i2c_algo_bit ioapic ipmi_devintf ipmi_si 
ipmi_msghandler ioatdma dca aacraid [last unloaded: ixgbe]
[818555.447546] CPU: 12 PID: 9179 Comm: nginx Not tainted 
3.12.49-clouder4-nproc #1
[818555.447604] Hardware name: Supermicro 
PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014
[818555.447667] task: ffff881b1d66e8d0 ti: ffff881ecbec8000 task.ti: 
ffff881ecbec8000
[818555.447724] RIP: 0010:[<ffffffff8157d5ec>]  [<ffffffff8157d5ec>] 
skb_copy_datagram_iovec+0x19c/0x2b0
[818555.447838] RSP: 0018:ffff881ecbec9b68  EFLAGS: 00010206
[818555.447892] RAX: 0000000000000000 RBX: 000000000000010a RCX: 
000000000000010a
[818555.447949] RDX: 000000000000010a RSI: 0000000000000000 RDI: 
ffff883bd56f3500
[818555.448006] RBP: ffff881ecbec9bc8 R08: 0000000000000000 R09: 
0000000000000000
[818555.448063] R10: ffff8832974df500 R11: 0000000000000000 R12: 
0000000000000000
[818555.448336] R13: 0731000007300000 R14: 0000000000000000 R15: 
0000000000000000
[818555.448609] FS:  00002ad13b927a80(0000) GS:ffff883fff080000(0000) 
knlGS:0000000000000000
[818555.448883] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[818555.449153] CR2: ffffffffff600400 CR3: 0000001ecf2ab000 CR4: 
00000000001407e0
[818555.449425] Stack:
[818555.449688]  ffff8832974df500 0000000000000040 ffff881ecbec9e48 
ffff881ecbec9d98
[818555.450148]  ffff881ecbec9e88 ffffffff00000000 000000000000000a 
ffff883bd56f3500
[818555.450605]  0000000000000000 ffff8832974df95c ffff8832974df570 
0000000000000000
[818555.451060] Call Trace:
[818555.451327]  [<ffffffff815d3cdb>] tcp_recvmsg+0x75b/0xcb0
[818555.451594]  [<ffffffff815fbe19>] inet_recvmsg+0x89/0xa0
[818555.451864]  [<ffffffff8156ece3>] sock_recvmsg+0xa3/0xd0
[818555.452138]  [<ffffffff811f0959>] ? ep_send_events_proc+0xa9/0x170
[818555.452412]  [<ffffffff8156edfe>] SYSC_recvfrom+0xee/0x170
[818555.452684]  [<ffffffff8156ee8e>] SyS_recvfrom+0xe/0x10
[818555.452957]  [<ffffffff81649ff2>] system_call_fastpath+0x16/0x1b
[818555.453227] Code: 0f b6 10 44 39 fa 0f 8f 23 ff ff ff 4c 8b 68 08 4d 85 ed 
74 5b 44 89 f0 0f 1f 80 00 00 00 00 42 8d 14 23 39 c2 0f 8c a6 00 00 00 <45> 8b 
7d 68 41 01 c7 45 89 fe 45 29 e6 45 85 f6 7e 27 41 39 de 
[818555.456980] RIP  [<ffffffff8157d5ec>] skb_copy_datagram_iovec+0x19c/0x2b0
[818555.457298]  RSP <ffff881ecbec9b68>


or [409292.391088] ------------[ cut here ]------------
[409292.391369] kernel BUG at mm/slub.c:3336!
[409292.391640] invalid opcode: 0000 [#1] SMP 
[409292.392005] Modules linked in: tcp_diag inet_diag act_police cls_basic 
sch_ingress xt_LOG xt_limit xt_addrtype xt_pkttype xt_state veth netconsole 
openvswitch gre vxlan ip_tunnel xt_owner xt_conntrack iptable_mangle xt_nat 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT 
nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm 
ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core ext2 dm_thin_pool dm_bio_prison 
dm_persistent_data dm_bufio dm_mirror dm_region_hash dm_log ses enclosure ixgbe 
i2c_i801 lpc_ich mfd_core igb i2c_algo_bit ioapic ipmi_devintf ipmi_si 
ipmi_msghandler ioatdma dca aacraid
[409292.396984] CPU: 10 PID: 37087 Comm: kworker/u80:0 Not tainted 
3.12.49-clouder4-nproc #1
[409292.397260] Hardware name: Supermicro 
PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0b 05/27/2014
[409292.397761] Workqueue: dm-thin do_worker [dm_thin_pool]
[409292.398077] task: ffff883f558c8810 ti: ffff882bf6ab2000 task.ti: 
ffff882bf6ab2000
[409292.398351] RIP: 0010:[<ffffffff8118dd62>]  [<ffffffff8118dd62>] 
kfree+0x172/0x180
[409292.398682] RSP: 0018:ffff883fff003cb0  EFLAGS: 00010246
[409292.398951] RAX: 06fc000000000000 RBX: ffff883f00000106 RCX: 
0000000000000001
[409292.399225] RDX: 0000000000000000 RSI: 0000003fac51189a RDI: 
ffffea00fc000000
[409292.399500] RBP: ffff883fff003cd0 R08: 0000000000000000 R09: 
0000000000000003
[409292.399771] R10: 0000000000000003 R11: ffff883fff003e68 R12: 
ffff883f000003c6
[409292.400044] R13: ffffffff8157967e R14: ffff881fce14d130 R15: 
ffff883fd2ebb140
[409292.400316] FS:  0000000000000000(0000) GS:ffff883fff000000(0000) 
knlGS:0000000000000000
[409292.400590] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[409292.400862] CR2: 0000000000f15720 CR3: 0000001c86ecb000 CR4: 
00000000001407e0
[409292.401137] Stack:
[409292.401402]  ffff881d12fdecc0 ffff881cae745700 ffff883f000003c6 
ffff881fce14d120
[409292.401866]  ffff883fff003ce0 ffffffff8157967e ffff883fff003d10 
ffffffff8157979c
[409292.402329]  0000000000000000 ffff881cae745700 ffff881cae745700 
ffff881fce14d120
[409292.402792] Call Trace:
[409292.403057]  <IRQ> 
[409292.403107] 
[409292.403422]  [<ffffffff8157967e>] skb_free_head+0x1e/0x80
[409292.403692]  [<ffffffff8157979c>] skb_release_data+0xbc/0x100
[409292.403963]  [<ffffffff81579808>] skb_release_all+0x28/0x30
[409292.404233]  [<ffffffff81579866>] __kfree_skb+0x16/0xa0
[409292.404507]  [<ffffffff81579c31>] consume_skb+0x31/0x90
[409292.404780]  [<ffffffff815850dd>] dev_kfree_skb_any+0x3d/0x50
[409292.405055]  [<ffffffffa00b311c>] ixgbe_poll+0xec/0x6b0 [ixgbe]
[409292.405328]  [<ffffffff8158ae1c>] net_rx_action+0x12c/0x280
[409292.405604]  [<ffffffff8108ef77>] __do_softirq+0x137/0x2e0
[409292.405878]  [<ffffffff8164b78c>] call_softirq+0x1c/0x30
[409292.406153]  [<ffffffff8104a35d>] do_softirq+0x8d/0xc0
[409292.406426]  [<ffffffff8108eb15>] irq_exit+0x95/0xa0
[409292.406695]  [<ffffffff8164bcf6>] do_IRQ+0x66/0xe0
[409292.406968]  [<ffffffff8164956f>] common_interrupt+0x6f/0x6f
[409292.407241]  <EOI> 
[409292.407290] 
[409292.407603]  [<ffffffff81141107>] ? mempool_free_slab+0x17/0x20
[409292.407879]  [<ffffffffa014a264>] ? do_worker+0xd4/0x270 [dm_thin_pool]
[409292.408154]  [<ffffffffa014a281>] ? do_worker+0xf1/0x270 [dm_thin_pool]
[409292.408432]  [<ffffffff810a64d5>] process_one_work+0x195/0x550
[409292.408700]  [<ffffffff810a877a>] worker_thread+0x13a/0x430
[409292.408967]  [<ffffffff810a8640>] ? manage_workers+0x2c0/0x2c0
[409292.409235]  [<ffffffff810ae77e>] kthread+0xce/0xe0
[409292.409500]  [<ffffffff810ae6b0>] ? kthread_freezable_should_stop+0x80/0x80
[409292.409770]  [<ffffffff81649f48>] ret_from_fork+0x58/0x90
[409292.410038]  [<ffffffff810ae6b0>] ? kthread_freezable_should_stop+0x80/0x80
[409292.410308] Code: 2a 48 8b 07 31 f6 f6 c4 40 74 03 8b 77 68 e8 16 b1 fb ff 
e9 73 ff ff ff 48 8b 47 30 48 8b 17 66 85 d2 48 0f 48 f8 e9 fe fe ff ff <0f> 0b 
eb fe 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 55 41 
[409292.414104] RIP  [<ffffffff8118dd62>] kfree+0x172/0x180
[409292.414423]  RSP <ffff883fff003cb0>


All of these are actually due to memory corruptions of the skb->head pointer. 
And strangely enough, collating the information from the various crashes reveal 
the following pattern : 
sk_buff.head = ffff883700000106
sk_buff.head = ffff883400000106
sk_buff.head = ffff883500000106
sk_buff.head = ffff883c00000106
sk_buff.head = ffff883800000106
sk_buff.head = ffff883f00000106
sk_buff.head = ffff883100000106

I've obtained those by manually reading the assembly which lead to the crash, I 
have also extracted the respective SKBs. At first I thought this could be a 
memory
problem but following a replacement of all the memory banks didn't change the 
situation. Initially I observed those crashes on 3.12.47 kernel and later 
updated to 
3.12.49 - still no change in situation. I have tested the 4.1.5 (ouf of tree 
IXGBE driver from sf.net page) with 3.12.49 - still crashes occurr. I have also 
reverted
to using 3.23 (still out of tree driver - no change). I've yet to test with the 
stock driver which I believe for kernel 3.12.49 is 3.15.1-k. At this point I'm 
inclined 
to believe that this might be due to a driver bug, which causes corruptions in 
the SKB which eventually lead to the aforementioned crashes. 

Any ideas or pointers how I may proceed to find the root cause of those 
crashes? I'm happy to provide any information you might consider important.
What puzzles me is why out of 7 servers with similar hardware, only 2 are 
exhibiting this behavior.

Regards, 
Nikolay

------------------------------------------------------------------------------
Presto, an open source distributed SQL query engine for big data, initially
developed by Facebook, enables you to easily query your data on Hadoop in a 
more interactive manner. Teradata is also now providing full enterprise
support for Presto. Download a free open source copy now.
http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to