Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out

Lutz Vieweg Thu, 09 Jun 2016 07:53:22 -0700

Bad news: It happened again today:
> Jun  9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT 
> device=04:00.0 domain=0x000e address=0x00000000000178c0 flags=0x0050]
> Jun  9 14:40:09 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT 
> device=04:00.0 domain=0x000e address=0x0000000000017900 flags=0x0050]
> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit 
> Hang#012  Tx Queue             <2>#012  TDH, TDT             <186>, <194>#012 
>  next_to_use          <194>#012  next_to_clean        
> <186>#012tx_buffer_info[next_to_clean]#012  time_stamp           
> <11df79bf7>#012  jiffies              <11df7aac8>
> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit 
> Hang#012  Tx Queue             <3>#012  TDH, TDT             <1e4>, <2>#012  
> next_to_use          <2>#012  next_to_clean        
> <1e4>#012tx_buffer_info[next_to_clean]#012  time_stamp           
> <11df79a0f>#012  jiffies              <11df7aac8>
> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1 
> detected on queue 3, resetting adapter
> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit 
> Hang#012  Tx Queue             <24>#012  TDH, TDT             <1ec>, <2>#012  
> next_to_use          <2>#012  next_to_clean        
> <1ec>#012tx_buffer_info[next_to_clean]#012  time_stamp           
> <11df79a0f>#012  jiffies              <11df7aac8>
> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset 
> due to tx timeout
> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1 
> detected on queue 24, resetting adapter
> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset 
> due to tx timeout
> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: Reset adapter
> Jun  9 14:40:13 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 2 
> detected on queue 2, resetting adapter
> Jun  9 14:40:14 computer kernel: ixgbe 0000:04:00.0: master disable timed out
  ...


And today, no other NIC connected to the same switch saw any "glitch".

I got you an "lspci -vvv" output, however, some interesting
> "pcilib: sysfs_read_vpd: read failed: Input/output error"
message is reported while lspci is emitting data on the NIC:

> 04:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit 
> X540-AT2 (rev 01)
>         Subsystem: Intel Corporation Ethernet Converged Network Adapter 
> X540-T1
>         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR+ FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> <TAbort+ <MAbort- >SERR- <PERR- INTx-
>         Latency: 0, Cache Line Size: 64 bytes
>         Interrupt: pin A routed to IRQ 59
>         Region 0: Memory at dce00000 (64-bit, prefetchable) [size=2M]
>         Region 4: Memory at dcdfc000 (64-bit, prefetchable) [size=16K]
>         Expansion ROM at dfd80000 [disabled] [size=512K]
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
> PME(D0+,D1-,D2-,D3hot+,D3cold-)
>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>         Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
>                 Address: 0000000000000000  Data: 0000
>                 Masking: 00000000  Pending: 00000000
>         Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
>                 Vector table: BAR=4 offset=00000000
>                 PBA: BAR=4 offset=00002000
>         Capabilities: [a0] Express (v2) Endpoint, MSI 00
>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s 
> <512ns, L1 <64us
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
>                 DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ 
> Unsupported+
>                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- 
> TransPend-
>                 LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit 
> Latency L0s <1us, L1 <8us
>                         ClockPM- Surprise- LLActRep- BwNot-
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ 
> DLActive- BWMgmt- ABWMgmt-
>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, 
> OBFF Not Supported
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, 
> OBFF Disabled
>                 LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
>                          Transmit Margin: Normal Operating Range, 
> EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -3.5dB, 
> EqualizationComplete-, EqualizationPhase1-
>                          EqualizationPhase2-, EqualizationPhase3-, 
> LinkEqualizationRequest-
>         Capabilities: [100 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- 
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> NonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> NonFatalErr+
>                 AERCap: First Error Pointer: 00, Gpcilib: sysfs_read_vpd: 
> read failed: Input/output error
> enCap+ CGenEn- ChkCap+ ChkEn-
>         Capabilities: [140 v1] Device Serial Number a0-36-9f-ff-ff-80-xx-xx
>         Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
>                 ARICap: MFVC- ACS-, Next Function: 0
>                 ARICtl: MFVC- ACS-, Function Group: 0
>         Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
>                 IOVCap: Migration-, Interrupt Message Number: 000
>                 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
>                 IOVSta: Migration-
>                 Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function 
> Dependency Link: 00
>                 VF offset: 128, stride: 2, Device ID: 1515
>                 Supported Page Size: 00000553, System Page Size: 00000001
>                 Region 0: Memory at 0000000000000000 (64-bit, 
> non-prefetchable)
>                 Region 3: Memory at 0000000000000000 (64-bit, 
> non-prefetchable)
>                 VF Migration: offset: 00000000, BIR: 0
>         Capabilities: [1d0 v1] Access Control Services
>                 ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- 
> UpstreamFwd- EgressCtrl- DirectTrans-
>                 ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- 
> UpstreamFwd- EgressCtrl- DirectTrans-
>         Kernel driver in use: ixgbe

This time I'll reboot the machine, and also try "iommu=pt" as suggested
in different places for use with 10G NICs.

Regards,

Lutz Vieweg



On 06/07/2016 05:47 PM, Alexander Duyck wrote:
> On Tue, Jun 7, 2016 at 2:35 AM, Lutz Vieweg <l...@5t9.de> wrote:
>> On 06/06/2016 11:52 PM, Alexander Duyck wrote:
>>> On Mon, Jun 6, 2016 at 2:26 PM, Lutz Vieweg <l...@5t9.de> wrote:
>>>> After updating a server with an Intel 10Gbase-T NIC from linux-4.4.1 to
>>>> linux-4.6.1 (vanilla, stable) we experienced (after ~2 days of operation)
>>>> the following bug:
>>>>
>>>> Jun  6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT
>>>> device=04:00.0 domain=0x000e address=0x000000001004ecc0 flags=0x0050]
>>>> Jun  6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT
>>>> device=04:00.0 domain=0x000e address=0x000000001004ed00 flags=0x0050]
>>>> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx 
>>>> Unit
>>>> Hang#012  Tx Queue             <3>#012  TDH, TDT             <1ce>,
>>>> <1e6>#012  next_to_use          <1e6>#012  next_to_clean
>>>> <1ce>#012tx_buffer_info[next_to_clean]#012  time_stamp
>>>> <10f7b215d>#012  jiffies              <10f7b3244>
>> ...
>>>> The ixgbe module was not able to restore the link after this, only "rmmod"
>>>> plus new initialization of the interface restored connectivity.
>>>>
>>>> Any idea what's going wrong, here?
>>
>>> There could be a number of things going on here.  Based on the offset
>>> of the fault it looks like an error on either a descriptor ring or Tx
>>> read since the Rx should be 2K aligned resulting in a write offsets
>>> that are no less than 128 byte aligned.
>>>
>>> One thing that might be useful would be to provide an lspci -vvv dump
>>> for the system just after the error has occurred.  It is possible that
>>> there may be additional data available in the advanced error reporting
>>> registers.
>>
>> I will record a "lspci -vvv" dump of the X540-AT2 NIC if the error
>> reoccurs. (When I run "lspci -vvv" now it does not seem to output
>> anything resembling an error reporting register - but
>> of course the ixgbe module was reloaded after the error.)
>>
>>> Also it might be useful to try and determine reproduction
>>> steps for this.  If you can narrow down what is done to trigger this
>>> error it would be easier for us to figure out what is causing it.
>>
>> That could be difficult... since the error occured after ~ days
>> of continous operation, and only once so far.
>>
>> But I did find one more interesting observation: One other server,
>> which is connected to the same 10Gbase-T switch, using an Intel 82598EB
>> NIC, experienced a two second link outage, 28 minutes after the
>> incident reported above:
>>
>>> Jun  6 19:37:39 computer2 kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is 
>>> Down
>>> Jun  6 19:37:41 computer2 kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is Up 
>>> 10 Gbps, Flow Control: RX/TX
>>
>> Nothing else happened on that server. And three more servers that are
>> also connected to the same 10Gbase-T switch experienced no link outage
>> at all.
>> A grep throught the log archives of all the servers connected to that
>> switch tells me that no other "link is down" events occurred during the
>> last year.
>>
>> So it might be that the error I reported is somehow triggered only
>> when certain network glitches occur.
>
> Right.  There are a few glitches that could occur that are completely
> out of our control such as cosmic rays and the like.  Depending on how
> much control you have over the data center it also doesn't hurt to
> make sure all systems are in a solid case and properly grounded to
> prevent any stray static charge buildup.  It's always possible it
> could be something like that but if that is the case then the same
> symptoms will likely not occur.
>
> It would probably be best to just keep an eye on this for now and if
> the issue doesn't reoccur then we are likely looking at something that
> isn't actually related to the hardware itself.
>
> - Alex
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> E1000-devel mailing list
> E1000-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/e1000-devel
> To learn more about Intel&#174; Ethernet, visit 
> http://communities.intel.com/community/wired
>


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out

Reply via email to