On Thu, Jun 9, 2016 at 7:48 AM, Lutz Vieweg <l...@5t9.de> wrote: > Bad news: It happened again today: >> Jun 9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT >> device=04:00.0 domain=0x000e address=0x00000000000178c0 flags=0x0050] >> Jun 9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT >> device=04:00.0 domain=0x000e address=0x0000000000017900 flags=0x0050] >> Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit >> Hang#012 Tx Queue <2>#012 TDH, TDT <186>, >> <194>#012 next_to_use <194>#012 next_to_clean >> <186>#012tx_buffer_info[next_to_clean]#012 time_stamp >> <11df79bf7>#012 jiffies <11df7aac8> >> Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit >> Hang#012 Tx Queue <3>#012 TDH, TDT <1e4>, <2>#012 >> next_to_use <2>#012 next_to_clean >> <1e4>#012tx_buffer_info[next_to_clean]#012 time_stamp >> <11df79a0f>#012 jiffies <11df7aac8> >> Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1 >> detected on queue 3, resetting adapter >> Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit >> Hang#012 Tx Queue <24>#012 TDH, TDT <1ec>, <2>#012 >> next_to_use <2>#012 next_to_clean >> <1ec>#012tx_buffer_info[next_to_clean]#012 time_stamp >> <11df79a0f>#012 jiffies <11df7aac8> >> Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset >> due to tx timeout >> Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1 >> detected on queue 24, resetting adapter >> Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset >> due to tx timeout >> Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Reset adapter >> Jun 9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 2 >> detected on queue 2, resetting adapter >> Jun 9 14:40:14 computer kernel: ixgbe 0000:04:00.0: master disable timed out > ... > > And today, no other NIC connected to the same switch saw any "glitch". > > I got you an "lspci -vvv" output, however, some interesting >> "pcilib: sysfs_read_vpd: read failed: Input/output error" > message is reported while lspci is emitting data on the NIC: > >> 04:00.0 Ethernet controller: Intel Corporation Ethernet Controller >> 10-Gigabit X540-AT2 (rev 01) >> Subsystem: Intel Corporation Ethernet Converged Network Adapter >> X540-T1 >> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >> Stepping- SERR+ FastB2B- DisINTx+ >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >> <TAbort+ <MAbort- >SERR- <PERR- INTx- >> Latency: 0, Cache Line Size: 64 bytes >> Interrupt: pin A routed to IRQ 59 >> Region 0: Memory at dce00000 (64-bit, prefetchable) [size=2M] >> Region 4: Memory at dcdfc000 (64-bit, prefetchable) [size=16K] >> Expansion ROM at dfd80000 [disabled] [size=512K] >> Capabilities: [40] Power Management version 3 >> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA >> PME(D0+,D1-,D2-,D3hot+,D3cold-) >> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- >> Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ >> Address: 0000000000000000 Data: 0000 >> Masking: 00000000 Pending: 00000000 >> Capabilities: [70] MSI-X: Enable+ Count=64 Masked- >> Vector table: BAR=4 offset=00000000 >> PBA: BAR=4 offset=00002000 >> Capabilities: [a0] Express (v2) Endpoint, MSI 00 >> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s >> <512ns, L1 <64us >> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ >> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ >> Unsupported+ >> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- >> MaxPayload 128 bytes, MaxReadReq 512 bytes >> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- >> TransPend- >> LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit >> Latency L0s <1us, L1 <8us >> ClockPM- Surprise- LLActRep- BwNot- >> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ >> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- >> LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ >> DLActive- BWMgmt- ABWMgmt- >> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, >> OBFF Not Supported >> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, >> LTR-, OBFF Disabled >> LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- >> Transmit Margin: Normal Operating Range, >> EnterModifiedCompliance- ComplianceSOS- >> Compliance De-emphasis: -6dB >> LnkSta2: Current De-emphasis Level: -3.5dB, >> EqualizationComplete-, EqualizationPhase1- >> EqualizationPhase2-, EqualizationPhase3-, >> LinkEqualizationRequest- >> Capabilities: [100 v2] Advanced Error Reporting >> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- >> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- >> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- >> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- >> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- >> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- >> NonFatalErr- >> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- >> NonFatalErr+ >> AERCap: First Error Pointer: 00, Gpcilib: sysfs_read_vpd: >> read failed: Input/output error >> enCap+ CGenEn- ChkCap+ ChkEn- >> Capabilities: [140 v1] Device Serial Number a0-36-9f-ff-ff-80-xx-xx >> Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) >> ARICap: MFVC- ACS-, Next Function: 0 >> ARICtl: MFVC- ACS-, Function Group: 0 >> Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) >> IOVCap: Migration-, Interrupt Message Number: 000 >> IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ >> IOVSta: Migration- >> Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function >> Dependency Link: 00 >> VF offset: 128, stride: 2, Device ID: 1515 >> Supported Page Size: 00000553, System Page Size: 00000001 >> Region 0: Memory at 0000000000000000 (64-bit, >> non-prefetchable) >> Region 3: Memory at 0000000000000000 (64-bit, >> non-prefetchable) >> VF Migration: offset: 00000000, BIR: 0 >> Capabilities: [1d0 v1] Access Control Services >> ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- >> UpstreamFwd- EgressCtrl- DirectTrans- >> ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- >> UpstreamFwd- EgressCtrl- DirectTrans- >> Kernel driver in use: ixgbe > > This time I'll reboot the machine, and also try "iommu=pt" as suggested > in different places for use with 10G NICs.
That might be a good place to start. I'm adding, or at least attempting to, the mailing list and maintainer for the IOMMU code. You might want to check with the AMD-Vi IOMMU maintainers to see if they have any other advice as this seems like something that may have been introduced with changes to the IOMMU as the ixgbe driver hasn't had any updates to the DMA mapping/unmapping code in some time and it was working in the 4.4 kernel series and still works on my system which runs an Intel IOMMU so I am wondering if this may be something specifically related to changes in the AMD IOMMU code. - Alex ------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired