On Mon, Jun 6, 2016 at 2:26 PM, Lutz Vieweg <l...@5t9.de> wrote:
> After updating a server with an Intel 10Gbase-T NIC from linux-4.4.1 to
> linux-4.6.1 (vanilla, stable) we experienced (after ~2 days of operation)
> the following bug:
>
> Jun  6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT
> device=04:00.0 domain=0x000e address=0x000000001004ecc0 flags=0x0050]
> Jun  6 19:09:31 computer kernel: AMD-Vi: Event logged [IO_PAGE_FAULT
> device=04:00.0 domain=0x000e address=0x000000001004ed00 flags=0x0050]
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit
> Hang#012  Tx Queue             <3>#012  TDH, TDT             <1ce>,
> <1e6>#012  next_to_use          <1e6>#012  next_to_clean
> <1ce>#012tx_buffer_info[next_to_clean]#012  time_stamp
> <10f7b215d>#012  jiffies              <10f7b3244>
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit
> Hang#012  Tx Queue             <1>#012  TDH, TDT             <fc>, <108>#012
>  next_to_use          <108>#012  next_to_clean
> <fc>#012tx_buffer_info[next_to_clean]#012  time_stamp
> <10f7b28c5>#012  jiffies              <10f7b3244>
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit
> Hang#012  Tx Queue             <0>#012  TDH, TDT             <16b>,
> <16f>#012  next_to_use          <16f>#012  next_to_clean
> <16b>#012tx_buffer_info[next_to_clean]#012  time_stamp
> <10f7b21d0>#012  jiffies              <10f7b3244>
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit
> Hang#012  Tx Queue             <4>#012  TDH, TDT             <69>, <8b>#012
>  next_to_use          <8b>#012  next_to_clean
> <69>#012tx_buffer_info[next_to_clean]#012  time_stamp
> <10f7b215d>#012  jiffies              <10f7b3244>
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1
> detected on queue 1, resetting adapter
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1
> detected on queue 0, resetting adapter
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit
> Hang#012  Tx Queue             <10>#012  TDH, TDT             <1c3>,
> <1c9>#012  next_to_use          <1c9>#012  next_to_clean
> <1c3>#012tx_buffer_info[next_to_clean]#012  time_stamp
> <10f7b215d>#012  jiffies              <10f7b3244>
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1
> detected on queue 4, resetting adapter
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset
> due to tx timeout
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset
> due to tx timeout
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 1
> detected on queue 10, resetting adapter
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset
> due to tx timeout
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset
> due to tx timeout
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: Reset adapter
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 2
> detected on queue 3, resetting adapter
> Jun  6 19:09:35 computer kernel: ixgbe 0000:04:00.0: master disable timed out
> Jun  6 19:09:36 computer kernel: br0: port 1(enp4s0) entered disabled state
> Jun  6 19:09:42 computer kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is Up
> 10 Gbps, Flow Control: RX/TX
> Jun  6 19:09:42 computer kernel: br0: port 1(enp4s0) entered blocking state
> Jun  6 19:09:42 computer kernel: br0: port 1(enp4s0) entered forwarding state
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit
> Hang#012  Tx Queue             <12>#012  TDH, TDT             <0>, <2>#012
> next_to_use          <2>#012  next_to_clean
> <0>#012tx_buffer_info[next_to_clean]#012  time_stamp
> <10f7b4c20>#012  jiffies              <10f7b544c>
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 2
> detected on queue 12, resetting adapter
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset
> due to tx timeout
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: Reset adapter
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 0 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 1 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 2 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 3 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 4 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 5 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 6 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 7 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 8 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 9 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 10 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 11 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 12 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 13 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 14 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 15 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 16 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 17 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 18 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 19 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 20 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 21 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 22 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 23 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 24 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 25 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 26 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 27 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 28 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 29 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 30 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 31 not cleared within the polling period
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0: master disable timed out
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 0 not cleared within the polling period
> ...
> Jun  6 19:09:44 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 31 not cleared within the polling period
> Jun  6 19:09:45 computer kernel: br0: port 1(enp4s0) entered disabled state
> Jun  6 19:09:50 computer kernel: ixgbe 0000:04:00.0 enp4s0: NIC Link is Up
> 10 Gbps, Flow Control: RX/TX
> Jun  6 19:09:50 computer kernel: br0: port 1(enp4s0) entered blocking state
> Jun  6 19:09:50 computer kernel: br0: port 1(enp4s0) entered forwarding state
> Jun  6 19:09:53 computer kernel: ixgbe 0000:04:00.0 enp4s0: Detected Tx Unit
> Hang#012  Tx Queue             <24>#012  TDH, TDT             <0>, <5>#012
> next_to_use          <5>#012  next_to_clean
> <0>#012tx_buffer_info[next_to_clean]#012  time_stamp
> <10f7b6e20>#012  jiffies              <10f7b767c>
> Jun  6 19:09:53 computer kernel: ixgbe 0000:04:00.0 enp4s0: tx hang 3
> detected on queue 24, resetting adapter
> Jun  6 19:09:53 computer kernel: ixgbe 0000:04:00.0 enp4s0: initiating reset
> due to tx timeout
> Jun  6 19:09:53 computer kernel: ixgbe 0000:04:00.0 enp4s0: Reset adapter
> Jun  6 19:09:53 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 0 not cleared within the polling period
> ...
> Jun  6 19:09:53 computer kernel: ixgbe 0000:04:00.0 enp4s0: RXDCTL.ENABLE on
> Rx queue 31 not cleared within the polling period
> Jun  6 19:09:53 computer kernel: ixgbe 0000:04:00.0: master disable timed out
>
>
> The ixgbe module was not able to restore the link after this, only "rmmod"
> plus new initialization of the interface restored connectivity.
>
> Any idea what's going wrong, here?
>
> Regards,
>
> Lutz Vieweg

There could be a number of things going on here.  Based on the offset
of the fault it looks like an error on either a descriptor ring or Tx
read since the Rx should be 2K aligned resulting in a write offsets
that are no less than 128 byte aligned.

One thing that might be useful would be to provide an lspci -vvv dump
for the system just after the error has occurred.  It is possible that
there may be additional data available in the advanced error reporting
registers.  Also it might be useful to try and determine reproduction
steps for this.  If you can narrow down what is done to trigger this
error it would be easier for us to figure out what is causing it.

Thanks.

- Alex

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to