On 06/06/2016 11:52 PM, Alexander Duyck wrote: > On Mon, Jun 6, 2016 at 2:26 PM, Lutz Vieweg <l...@5t9.de> wrote: >> After updating a server with an Intel 10Gbase-T NIC from linux-4.4.1 to >> linux-4.6.1 (vanilla, stable) we experienced (after ~2 days of operation) >> the following bug: >> >> Jun 6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT >> device=04:00.0 domain=0x000e address=0x000000001004ecc0 flags=0x0050] >> Jun 6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT >> device=04:00.0 domain=0x000e address=0x000000001004ed00 flags=0x0050] >> Jun 6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit >> Hang#012 Tx Queue <3>#012 TDH, TDT <1ce>, >> <1e6>#012 next_to_use <1e6>#012 next_to_clean >> <1ce>#012tx_buffer_info[next_to_clean]#012 time_stamp >> <10f7b215d>#012 jiffies <10f7b3244> ... >> The ixgbe module was not able to restore the link after this, only "rmmod" >> plus new initialization of the interface restored connectivity. >> >> Any idea what's going wrong, here?
> There could be a number of things going on here. Based on the offset > of the fault it looks like an error on either a descriptor ring or Tx > read since the Rx should be 2K aligned resulting in a write offsets > that are no less than 128 byte aligned. > > One thing that might be useful would be to provide an lspci -vvv dump > for the system just after the error has occurred. It is possible that > there may be additional data available in the advanced error reporting > registers. I will record a "lspci -vvv" dump of the X540-AT2 NIC if the error reoccurs. (When I run "lspci -vvv" now it does not seem to output anything resembling an error reporting register - but of course the ixgbe module was reloaded after the error.) > Also it might be useful to try and determine reproduction > steps for this. If you can narrow down what is done to trigger this > error it would be easier for us to figure out what is causing it. That could be difficult... since the error occured after ~ days of continous operation, and only once so far. But I did find one more interesting observation: One other server, which is connected to the same 10Gbase-T switch, using an Intel 82598EB NIC, experienced a two second link outage, 28 minutes after the incident reported above: > Jun 6 19:37:39 computer2 kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is Down > Jun 6 19:37:41 computer2 kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is Up > 10 Gbps, Flow Control: RX/TX Nothing else happened on that server. And three more servers that are also connected to the same 10Gbase-T switch experienced no link outage at all. A grep throught the log archives of all the servers connected to that switch tells me that no other "link is down" events occurred during the last year. So it might be that the error I reported is somehow triggered only when certain network glitches occur. Regards, Lutz Vieweg ------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired