[E1000-devel] i350, kernel 4.19.29 igb transmit timeout, no irq handler for vector

adam radford Thu, 19 Dec 2019 10:35:47 -0800

e1000-devel list,

Problem:


With kernel 4.19.29 and igb 5.4.0-k on Intel E5-2618Lv4 and  E5-2648Lv4 servers:

# ethtool -i eth1
driver: igb
version: 5.4.0-k
firmware-version: 1.63, 0x800009fa
expansion-rom-version:
bus-info: 0000:01:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

[    9.417249] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[    9.424543] igb: Copyright (c) 2007-2014 Intel Corporation.
[    9.487321] igb 0000:01:00.0: added PHC on eth0
[    9.492188] igb 0000:01:00.0: Intel(R) Gigabit Ethernet Network Connection
[    9.499396] igb 0000:01:00.0: eth0: (PCIe:5.0Gb/s:Width x4) 00:50:cc:1d:bd:c6
[    9.506939] igb 0000:01:00.0: eth0: PBA No: 106300-000
[    9.512405] igb 0000:01:00.0: Using MSI-X interrupts. 8 rx
queue(s), 8 tx queue(s)
[    9.597066] igb 0000:01:00.1: added PHC on eth1
[    9.601934] igb 0000:01:00.1: Intel(R) Gigabit Ethernet Network Connection
[    9.609148] igb 0000:01:00.1: eth1: (PCIe:5.0Gb/s:Width x4) 00:50:cc:1d:bd:c7
[    9.616694] igb 0000:01:00.1: eth1: PBA No: 106300-000
[    9.622162] igb 0000:01:00.1: Using MSI-X interrupts. 8 rx
queue(s), 8 tx queue(s)
[    9.684412] igb 0000:01:00.2: added PHC on eth2
[    9.689504] igb 0000:01:00.2: Intel(R) Gigabit Ethernet Network Connection
[    9.696713] igb 0000:01:00.2: eth2: (PCIe:5.0Gb/s:Width x4) 00:50:cc:1d:bd:c8
[    9.704256] igb 0000:01:00.2: eth2: PBA No: 106300-000
[    9.709720] igb 0000:01:00.2: Using MSI-X interrupts. 8 rx
queue(s), 8 tx queue(s)

PCI config space from i.e. eth0:

01:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
        Subsystem: Seagate Technology PLC Device 8005
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 128 bytes
        Interrupt: pin A routed to IRQ 16
        NUMA node: 0
        Region 0: Memory at 97840000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at 2040 [size=32]
        Region 3: Memory at 97868000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Capabilities: [a0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
<512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+
FLReset+ SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+
AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L0s L1,
Exit Latency L0s <4us, L1 <32us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+,
LTR+, OBFF Not Supported
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms,
TimeoutDis-, LTR+, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-,
LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+
ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [140 v1] Device Serial Number 00-50-cc-ff-ff-1d-bd-c6
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 1
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                IOVSta: Migration-
                Initial VFs: 8, Total VFs: 8, Number of VFs: 0,
Function Dependency Link: 00
                VF offset: 384, stride: 4, Device ID: 1520
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 0000183fffea0000 (64-bit, prefetchable)
                Region 3: Memory at 0000183fffe80000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [1a0 v1] Transaction Processing Hints
                Device specific mode supported
                Steering table in TPH capability structure
        Capabilities: [1c0 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [1d0 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir-
UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir-
UpstreamFwd- EgressCtrl- DirectTrans-
        Kernel driver in use: igb

Using the default 8 MSI-X vectors (combined TX/RX queues), after 2
days to 30 days of runtime we suddenly see unhandled interrupts, back
to back TX timeouts / Reset Adapter sequences, or both at the same
time on igb, i.e.:

2019-12-01T17:57:33.797+0000 controller- user.emerg kernel:
[919635.664612] do_IRQ: 13.53 No irq handler for vector
2019-12-01T17:57:35.845+0000 controller- user.emerg kernel:
[919637.712587] do_IRQ: 13.53 No irq handler for vector
2019-12-01T17:57:37.829+0000 controller- user.emerg kernel:
[919639.696569] do_IRQ: 13.53 No irq handler for vector
2019-12-01T17:57:39.237+0000 controller- user.err kernel:
[919641.103011] igb 0000:01:00.1 eth1: Reset adapter
2019-12-01T17:57:39.237+0000 controller- user.emerg kernel:
[919641.103021] do_IRQ: 13.53 No irq handler for vector
2019-12-01T17:57:39.266+0000 controller- user.err kernel:
[919641.132142] igb 0000:01:00.2 eth2: Reset adapter
2019-12-01T17:57:39.268+0000 controller- user.info kernel:
[919641.139341] igb 0000:01:00.2 eth2: igb: eth2 NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX
2019-12-01T17:57:39.346+0000 controller- user.emerg kernel:
[919641.212614] do_IRQ: 13.53 No irq handler for vector
2019-12-01T17:57:39.363+0000 controller- user.info kernel:
[919641.234249] igb 0000:01:00.2 eth2: igb: eth2 NIC Link is Down
2019-12-01T17:57:41.796+0000 controller- user.emerg kernel:
[919643.663562] do_IRQ: 13.53 No irq handler for vector
2019-12-01T17:57:42.790+0000 controller- user.info kernel:
[919644.661800] igb 0000:01:00.2 eth2: igb: eth2 NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX
2019-12-01T17:57:43.841+0000 controller- user.info kernel:
[919645.712796] igb 0000:01:00.1 eth1: igb: eth1 NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX
2019-12-01T17:57:43.849+0000 controller- user.emerg kernel:
[919645.714693] do_IRQ: 13.53 No irq handler for vector
2019-12-01T17:57:45.831+0000 controller- user.emerg kernel:
[919647.698208] do_IRQ: 13.53 No irq handler for vector

Additional observations:

1. The above unhandled MSI-X interrupt message above happens in most
cases every 2 seconds and it appears to be coming from
igb_watchdog_task(), i.e. from this commit:

commit 7a6ea550f2f7592742ac765e5a3b4b5d1461e0bd
Author: Alexander Duyck <alexander.h.du...@intel.com>
Date:  Tue Aug 26 04:25:03 2008 -0700
  igb: force all queues to interrupt once every 2 seconds

2. We have found that when we ifconfig eth1 down / ifconfig eth1 up,
the unhandled interrupt problem stops but the back to back TX timeout
/ Adapter reset sequences do not stop.

3. We have found that rmmod igb / insmod igb makes both problems go
away temporarily.

4. We have found that for both the unhandled interrupt issue and the
tx timeout issue, if we look at the TX/RX interrupt MSI-X irq counts
for 1 igb interace in /proc/interrupts, one of them has stopped
incrementing, and never increments again (not every 2 seconds anymore,
and not even once after several days) i.e.:

# date ; cat /proc/interrupts | grep eth1 | grep "51:"
Thu Dec 19 17:41:04 GMT 2019
 51:          0      19992      22218      40124      32935      28772
     35734      48672      24207      49188    3171172      65812
36653      17586      46483      23624      40544      39936
37085      11671   PCI-MSI 526341-edge      eth1-TxRx-4
# date ; cat /proc/interrupts | grep eth1 | grep "51:"
Thu Dec 19 17:41:12 GMT 2019
 51:          0      19992      22218      40124      32935      28772
     35734      48672      24207      49188    3171172      65812
36653      17586      46483      23624      40544      39936
37085      11671   PCI-MSI 526341-edge      eth1-TxRx-4

The other TX/Rx vector counts keep incrementing on the igb interfaces.

5. We do not see either problem with our older igb kernel/driver i.e.
kernel 3.10.0-327.18.2.el7 (Centos) using igb driver 5.2.15-k.

6. We have several other devices using muti vector MSI-X on this
4.19.29  kernel/hardware without any issues.

7. We think PCIe ASPM has been disabled by the kernel at boot time:

[    2.981858] ACPI FADT declares the system doesn't support PCIe
ASPM, so disable it

8. We have not performed any runtime PM operations:

# cat /sys/class/net/eth1/power/runtime_suspended_time
0

What could be causing these issues?

Thank you for your reply!

-Adam Radford


_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
https://forums.intel.com/s/topic/0TO0P00000018NbWAI/intel-ethernet

[E1000-devel] i350, kernel 4.19.29 igb transmit timeout, no irq handler for vector

Reply via email to