Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out

Lutz Vieweg Tue, 07 Jun 2016 02:38:26 -0700

On 06/06/2016 11:52 PM, Alexander Duyck wrote:
> On Mon, Jun 6, 2016 at 2:26 PM, Lutz Vieweg <[email protected]> wrote:
>> After updating a server with an Intel 10Gbase-T NIC from linux-4.4.1 to
>> linux-4.6.1 (vanilla, stable) we experienced (after ~2 days of operation)
>> the following bug:
>>
>> Jun  6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT
>> device=04:00.0 domain=0x000e address=0x000000001004ecc0 flags=0x0050]
>> Jun  6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT
>> device=04:00.0 domain=0x000e address=0x000000001004ed00 flags=0x0050]
>> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit
>> Hang#012  Tx Queue             <3>#012  TDH, TDT             <1ce>,
>> <1e6>#012  next_to_use          <1e6>#012  next_to_clean
>> <1ce>#012tx_buffer_info[next_to_clean]#012  time_stamp
>> <10f7b215d>#012  jiffies              <10f7b3244>
...
>> The ixgbe module was not able to restore the link after this, only "rmmod"
>> plus new initialization of the interface restored connectivity.
>>
>> Any idea what's going wrong, here?


> There could be a number of things going on here.  Based on the offset
> of the fault it looks like an error on either a descriptor ring or Tx
> read since the Rx should be 2K aligned resulting in a write offsets
> that are no less than 128 byte aligned.
>
> One thing that might be useful would be to provide an lspci -vvv dump
> for the system just after the error has occurred.  It is possible that
> there may be additional data available in the advanced error reporting
> registers.

I will record a "lspci -vvv" dump of the X540-AT2 NIC if the error
reoccurs. (When I run "lspci -vvv" now it does not seem to output
anything resembling an error reporting register - but
of course the ixgbe module was reloaded after the error.)

> Also it might be useful to try and determine reproduction
> steps for this.  If you can narrow down what is done to trigger this
> error it would be easier for us to figure out what is causing it.

That could be difficult... since the error occured after ~ days
of continous operation, and only once so far.

But I did find one more interesting observation: One other server,
which is connected to the same 10Gbase-T switch, using an Intel 82598EB
NIC, experienced a two second link outage, 28 minutes after the
incident reported above:

> Jun  6 19:37:39 computer2 kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is Down
> Jun  6 19:37:41 computer2 kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is Up 
> 10 Gbps, Flow Control: RX/TX

Nothing else happened on that server. And three more servers that are
also connected to the same 10Gbase-T switch experienced no link outage
at all.
A grep throught the log archives of all the servers connected to that
switch tells me that no other "link is down" events occurred during the
last year.

So it might be that the error I reported is somehow triggered only
when certain network glitches occur.

Regards,

Lutz Vieweg


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] AMD-Vi: Event logged IO_PAGE_FAULT - ixgbe Detected Tx Unit Hang - Reset adapter - master disable timed out

Reply via email to