Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out

Alexander Duyck Tue, 07 Jun 2016 08:49:22 -0700

On Tue, Jun 7, 2016 at 2:35 AM, Lutz Vieweg <l...@5t9.de> wrote:
> On 06/06/2016 11:52 PM, Alexander Duyck wrote:
>> On Mon, Jun 6, 2016 at 2:26 PM, Lutz Vieweg <l...@5t9.de> wrote:
>>> After updating a server with an Intel 10Gbase-T NIC from linux-4.4.1 to
>>> linux-4.6.1 (vanilla, stable) we experienced (after ~2 days of operation)
>>> the following bug:
>>>
>>> Jun  6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT
>>> device=04:00.0 domain=0x000e address=0x000000001004ecc0 flags=0x0050]
>>> Jun  6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT
>>> device=04:00.0 domain=0x000e address=0x000000001004ed00 flags=0x0050]
>>> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit
>>> Hang#012  Tx Queue             <3>#012  TDH, TDT             <1ce>,
>>> <1e6>#012  next_to_use          <1e6>#012  next_to_clean
>>> <1ce>#012tx_buffer_info[next_to_clean]#012  time_stamp
>>> <10f7b215d>#012  jiffies              <10f7b3244>
> ...
>>> The ixgbe module was not able to restore the link after this, only "rmmod"
>>> plus new initialization of the interface restored connectivity.
>>>
>>> Any idea what's going wrong, here?
>
>> There could be a number of things going on here.  Based on the offset
>> of the fault it looks like an error on either a descriptor ring or Tx
>> read since the Rx should be 2K aligned resulting in a write offsets
>> that are no less than 128 byte aligned.
>>
>> One thing that might be useful would be to provide an lspci -vvv dump
>> for the system just after the error has occurred.  It is possible that
>> there may be additional data available in the advanced error reporting
>> registers.
>
> I will record a "lspci -vvv" dump of the X540-AT2 NIC if the error
> reoccurs. (When I run "lspci -vvv" now it does not seem to output
> anything resembling an error reporting register - but
> of course the ixgbe module was reloaded after the error.)
>
>> Also it might be useful to try and determine reproduction
>> steps for this.  If you can narrow down what is done to trigger this
>> error it would be easier for us to figure out what is causing it.
>
> That could be difficult... since the error occured after ~ days
> of continous operation, and only once so far.
>
> But I did find one more interesting observation: One other server,
> which is connected to the same 10Gbase-T switch, using an Intel 82598EB
> NIC, experienced a two second link outage, 28 minutes after the
> incident reported above:
>
>> Jun  6 19:37:39 computer2 kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is Down
>> Jun  6 19:37:41 computer2 kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is Up 
>> 10 Gbps, Flow Control: RX/TX
>
> Nothing else happened on that server. And three more servers that are
> also connected to the same 10Gbase-T switch experienced no link outage
> at all.
> A grep throught the log archives of all the servers connected to that
> switch tells me that no other "link is down" events occurred during the
> last year.
>
> So it might be that the error I reported is somehow triggered only
> when certain network glitches occur.


Right.  There are a few glitches that could occur that are completely
out of our control such as cosmic rays and the like.  Depending on how
much control you have over the data center it also doesn't hurt to
make sure all systems are in a solid case and properly grounded to
prevent any stray static charge buildup.  It's always possible it
could be something like that but if that is the case then the same
symptoms will likely not occur.

It would probably be best to just keep an eye on this for now and if
the issue doesn't reoccur then we are likely looking at something that
isn't actually related to the hardware itself.

- Alex

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out

Reply via email to