Bad news: It happened again today: > Jun 9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT > device=04:00.0 domain=0x000e address=0x00000000000178c0 flags=0x0050] > Jun 9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT > device=04:00.0 domain=0x000e address=0x0000000000017900 flags=0x0050] > Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit > Hang#012 Tx Queue <2>#012 TDH, TDT <186>, <194>#012 > next_to_use <194>#012 next_to_clean > <186>#012tx_buffer_info[next_to_clean]#012 time_stamp > <11df79bf7>#012 jiffies <11df7aac8> > Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit > Hang#012 Tx Queue <3>#012 TDH, TDT <1e4>, <2>#012 > next_to_use <2>#012 next_to_clean > <1e4>#012tx_buffer_info[next_to_clean]#012 time_stamp > <11df79a0f>#012 jiffies <11df7aac8> > Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1 > detected on queue 3, resetting adapter > Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit > Hang#012 Tx Queue <24>#012 TDH, TDT <1ec>, <2>#012 > next_to_use <2>#012 next_to_clean > <1ec>#012tx_buffer_info[next_to_clean]#012 time_stamp > <11df79a0f>#012 jiffies <11df7aac8> > Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset > due to tx timeout > Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1 > detected on queue 24, resetting adapter > Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset > due to tx timeout > Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Reset adapter > Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 2 > detected on queue 2, resetting adapter > Jun 9 14:40:14 computer kernel: ixgbe 0000:04:00.0: master disable timed out ...
And today, no other NIC connected to the same switch saw any "glitch". I got you an "lspci -vvv" output, however, some interesting > "pcilib: sysfs_read_vpd: read failed: Input/output error" message is reported while lspci is emitting data on the NIC: > 04:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit > X540-AT2 (rev 01) > Subsystem: Intel Corporation Ethernet Converged Network Adapter > X540-T1 > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR+ FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > <TAbort+ <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 59 > Region 0: Memory at dce00000 (64-bit, prefetchable) [size=2M] > Region 4: Memory at dcdfc000 (64-bit, prefetchable) [size=16K] > Expansion ROM at dfd80000 [disabled] [size=512K] > Capabilities: [40] Power Management version 3 > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA > PME(D0+,D1-,D2-,D3hot+,D3cold-) > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ > Address: 0000000000000000 Data: 0000 > Masking: 00000000 Pending: 00000000 > Capabilities: [70] MSI-X: Enable+ Count=64 Masked- > Vector table: BAR=4 offset=00000000 > PBA: BAR=4 offset=00002000 > Capabilities: [a0] Express (v2) Endpoint, MSI 00 > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s > <512ns, L1 <64us > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ > DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ > Unsupported+ > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- > TransPend- > LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit > Latency L0s <1us, L1 <8us > ClockPM- Surprise- LLActRep- BwNot- > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ > DLActive- BWMgmt- ABWMgmt- > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, > OBFF Not Supported > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, > OBFF Disabled > LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- > Transmit Margin: Normal Operating Range, > EnterModifiedCompliance- ComplianceSOS- > Compliance De-emphasis: -6dB > LnkSta2: Current De-emphasis Level: -3.5dB, > EqualizationComplete-, EqualizationPhase1- > EqualizationPhase2-, EqualizationPhase3-, > LinkEqualizationRequest- > Capabilities: [100 v2] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- > RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- > RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- > RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- > NonFatalErr- > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- > NonFatalErr+ > AERCap: First Error Pointer: 00, Gpcilib: sysfs_read_vpd: > read failed: Input/output error > enCap+ CGenEn- ChkCap+ ChkEn- > Capabilities: [140 v1] Device Serial Number a0-36-9f-ff-ff-80-xx-xx > Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) > ARICap: MFVC- ACS-, Next Function: 0 > ARICtl: MFVC- ACS-, Function Group: 0 > Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) > IOVCap: Migration-, Interrupt Message Number: 000 > IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ > IOVSta: Migration- > Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function > Dependency Link: 00 > VF offset: 128, stride: 2, Device ID: 1515 > Supported Page Size: 00000553, System Page Size: 00000001 > Region 0: Memory at 0000000000000000 (64-bit, > non-prefetchable) > Region 3: Memory at 0000000000000000 (64-bit, > non-prefetchable) > VF Migration: offset: 00000000, BIR: 0 > Capabilities: [1d0 v1] Access Control Services > ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- > UpstreamFwd- EgressCtrl- DirectTrans- > ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- > UpstreamFwd- EgressCtrl- DirectTrans- > Kernel driver in use: ixgbe This time I'll reboot the machine, and also try "iommu=pt" as suggested in different places for use with 10G NICs. Regards, Lutz Vieweg On 06/07/2016 05:47 PM, Alexander Duyck wrote: > On Tue, Jun 7, 2016 at 2:35 AM, Lutz Vieweg <l...@5t9.de> wrote: >> On 06/06/2016 11:52 PM, Alexander Duyck wrote: >>> On Mon, Jun 6, 2016 at 2:26 PM, Lutz Vieweg <l...@5t9.de> wrote: >>>> After updating a server with an Intel 10Gbase-T NIC from linux-4.4.1 to >>>> linux-4.6.1 (vanilla, stable) we experienced (after ~2 days of operation) >>>> the following bug: >>>> >>>> Jun 6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT >>>> device=04:00.0 domain=0x000e address=0x000000001004ecc0 flags=0x0050] >>>> Jun 6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT >>>> device=04:00.0 domain=0x000e address=0x000000001004ed00 flags=0x0050] >>>> Jun 6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx >>>> Unit >>>> Hang#012 Tx Queue <3>#012 TDH, TDT <1ce>, >>>> <1e6>#012 next_to_use <1e6>#012 next_to_clean >>>> <1ce>#012tx_buffer_info[next_to_clean]#012 time_stamp >>>> <10f7b215d>#012 jiffies <10f7b3244> >> ... >>>> The ixgbe module was not able to restore the link after this, only "rmmod" >>>> plus new initialization of the interface restored connectivity. >>>> >>>> Any idea what's going wrong, here? >> >>> There could be a number of things going on here. Based on the offset >>> of the fault it looks like an error on either a descriptor ring or Tx >>> read since the Rx should be 2K aligned resulting in a write offsets >>> that are no less than 128 byte aligned. >>> >>> One thing that might be useful would be to provide an lspci -vvv dump >>> for the system just after the error has occurred. It is possible that >>> there may be additional data available in the advanced error reporting >>> registers. >> >> I will record a "lspci -vvv" dump of the X540-AT2 NIC if the error >> reoccurs. (When I run "lspci -vvv" now it does not seem to output >> anything resembling an error reporting register - but >> of course the ixgbe module was reloaded after the error.) >> >>> Also it might be useful to try and determine reproduction >>> steps for this. If you can narrow down what is done to trigger this >>> error it would be easier for us to figure out what is causing it. >> >> That could be difficult... since the error occured after ~ days >> of continous operation, and only once so far. >> >> But I did find one more interesting observation: One other server, >> which is connected to the same 10Gbase-T switch, using an Intel 82598EB >> NIC, experienced a two second link outage, 28 minutes after the >> incident reported above: >> >>> Jun 6 19:37:39 computer2 kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is >>> Down >>> Jun 6 19:37:41 computer2 kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is Up >>> 10 Gbps, Flow Control: RX/TX >> >> Nothing else happened on that server. And three more servers that are >> also connected to the same 10Gbase-T switch experienced no link outage >> at all. >> A grep throught the log archives of all the servers connected to that >> switch tells me that no other "link is down" events occurred during the >> last year. >> >> So it might be that the error I reported is somehow triggered only >> when certain network glitches occur. > > Right. There are a few glitches that could occur that are completely > out of our control such as cosmic rays and the like. Depending on how > much control you have over the data center it also doesn't hurt to > make sure all systems are in a solid case and properly grounded to > prevent any stray static charge buildup. It's always possible it > could be something like that but if that is the case then the same > symptoms will likely not occur. > > It would probably be best to just keep an eye on this for now and if > the issue doesn't reoccur then we are likely looking at something that > isn't actually related to the hardware itself. > > - Alex > > ------------------------------------------------------------------------------ > What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic > patterns at an interface-level. Reveals which users, apps, and protocols are > consuming the most bandwidth. Provides multi-vendor support for NetFlow, > J-Flow, sFlow and other flows. Make informed decisions using capacity > planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e > _______________________________________________ > E1000-devel mailing list > E1000-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/e1000-devel > To learn more about Intel® Ethernet, visit > http://communities.intel.com/community/wired > ------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired