Hello Mathias, Roger, Gabriel
I've been chasing strange MSI / legacy IRQ behavior from xHCI for a
couple days and wanted to report a few things that may be effected by
Gabriel's recent "xhci: Cleanup only when releasing primary hcd" patch
(more on this at the bottom).
After 8c24d6d7b09d "usb: xhci: stop everything on the first call to
xhci_stop" (v4.3+), there exists a device removal window that can later
leak interrupts and other device resources ordinarily cleaned up by
xhci_stop.
The quick summary is that if the xhci_handshake call from xhci_halt
fails, it will set xhci->xhc_state |= XHCI_STATE_HALTED. Later when the
device is removed, xhci_stop keys off the xhc_state and will exit early
if XHCI_STATE_HALTED is set.
This manifested itself on my Stratus machine while running repeated
device hotplug removals at various timing intervals during
initialization (for example 3, 6 or 12 seconds after hotplug add). I
didn't grab a backtrace of xhci_halt, so I'm not exactly sure why it was
called after xhci_run, but note that it is not only called from
xhci_stop, but also xhci_handle_command_timeout and others.
The result was that a later device initialization couldn't allocate its
MSI interrupts -- after much instrumentation I finally traced it back to
the previous initialization / removal failing to free the MSI irqs it
requested.
Here's the relevent snippet from added trace logging.
(ftmod_ioboard_event and ioboard_bringdown are functions from an
out-of-tree PCI hotplug driver for Stratus hardware platforms. These
routines will call into PCI API to add or remove a set of PCI devices so
I left them here to help mark device lifetimes.)
Last successful MSI allocation
==============================
java-17858 [002] .... 3005.763238: ftmod_ioboard_event: adding bus
03
java-17858 [009] .... 3009.780359: usb_add_hcd:
request_irq(irq=10, desc(ffff88085f41bba8), hcd(ffff8808518e53d8)) return=0
java-17858 [004] .... 3009.808979: usb_add_hcd:
request_irq(irq=11, desc(ffff88085f41bef8), hcd(ffff880482d992a8)) return=0
java-17858 [008] .... 3009.828263: xhci_halt:
xhci(ffff88083f5a8258)
java-17858 [008] .... 3009.828269: xhci_quiesce:
xhci(ffff88083f5a8258) halted=1
java-17858 [008] .... 3009.828273: xhci_quiesce:
xhci(ffff88083f5a8258) cmd=0
java-17858 [008] .... 3009.829614: xhci_setup_msix:
xhci=ffff88083f5a8258
java-17858 [008] .... 3009.829627: xhci_setup_msix:
xhci=ffff88083f5a8258 Failed to enable MSI-X, ret=-22
java-17858 [008] .... 3009.829635: xhci_run: xhci=ffff88083f5a8258
java-17858 [008] .... 3009.830069: xhci_run: request_irq(119, ...)
MSI desc=ffff88083ac32e68 ret=0
java-17858 [007] .... 3010.009977: xhci_halt:
xhci(ffff88083f5a8258)
xhci_halt is called, perhaps from
xhci_handle_command_timeout?
java-17858 [007] .... 3010.009985: xhci_quiesce:
xhci(ffff88083f5a8258) halted=1
java-17858 [007] .... 3010.009986: xhci_quiesce:
xhci(ffff88083f5a8258) cmd=ffffffff
cmd = ~0, so the hardware is gone! subsequent
call to xhci_handshake will detect this and return
-ENODEV, xhci_halt will then set
xhci->xhc_state |= XHCI_STATE_HALTED
java-17858 [007] d... 3010.010381: xhci_irq:
hcd(ffff88083f5a8008)->irq=0 (depth=0, count=300, unhandled=0), status==~0
xhci_irq notices the missing hardware, too.
java-17858 [007] .... 3010.014476: xhci_stop:
xhci(ffff88083f5a8258)->xhc_state=2
xhci_stop is called and exits early because
(xhci->xhc_state & XHCI_STATE_HALTED)
Notice no entries for xhci_free_irq!
java-17858 [007] .... 3010.029101: ftmod_ioboard_event: adding bus
79
kworker/u97:3-974 [009] .... 3013.264132: ioboard_bringdown: board 0
kworker/u97:3-974 [009] .... 3013.264145: ioboard_bringdown: sbus=0,
removing bus 03
kworker/u97:3-974 [005] .... 3016.551113: ioboard_bringdown: sbus=1,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551119: ioboard_bringdown: sbus=2,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551119: ioboard_bringdown: sbus=3,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551119: ioboard_bringdown: sbus=4,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=5,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=6,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=7,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=8,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551120: ioboard_bringdown: sbus=9,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551124: ioboard_bringdown: sbus=10,
pdev = (null)
kworker/u97:3-974 [005] .... 3016.551125: ioboard_bringdown: sbus=11,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551126: ioboard_bringdown: sbus=12,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551126: ioboard_bringdown: sbus=13,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551126: ioboard_bringdown: sbus=14,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=15,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=16,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=17,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=18,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551127: ioboard_bringdown: sbus=19,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=20,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=21,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=22,
bs->TopLevelBridge[sbus] = 0
kworker/u97:3-974 [005] .... 3016.551128: ioboard_bringdown: sbus=23,
bs->TopLevelBridge[sbus] = 0
First failed MSI allocation
===========================
java-17858 [011] .... 3084.238393: ftmod_ioboard_event: adding bus
03
java-17858 [005] .... 3088.071479: usb_add_hcd:
request_irq(irq=10, desc(ffff88085f41bba8), hcd(ffff8804f5cc2548)) return=0
java-17858 [005] .... 3088.096260: usb_add_hcd:
request_irq(irq=11, desc(ffff88085f41bef8), hcd(ffff8804f5cc6678)) return=0
java-17858 [000] .N.. 3088.121585: xhci_halt:
xhci(ffff880857730258)
java-17858 [000] .N.. 3088.121593: xhci_quiesce:
xhci(ffff880857730258) halted=1
java-17858 [000] .N.. 3088.121597: xhci_quiesce:
xhci(ffff880857730258) cmd=0
java-17858 [000] .... 3088.123387: xhci_setup_msix:
xhci=ffff880857730258
java-17858 [000] .... 3088.123400: xhci_setup_msix:
xhci=ffff880857730258 Failed to enable MSI-X, ret=-22
java-17858 [000] .... 3088.123408: xhci_run: xhci=ffff880857730258
java-17858 [000] .... 3088.123486: msi_domain_alloc_irqs: virq =
-17
#define EEXIST 17 /* File exists */
java-17858 [000] .... 3088.123495: pci_enable_msi_range:
pci_msi_setup_msi_irqs ret=-28
java-17858 [000] .... 3088.123497: xhci_run: xhci=ffff880857730258
failed to allocate MSI entry, ret=-28
#define ENOSPC 28 /* No space left on device */
java-17858 [000] .... 3088.123548: xhci_run: request_irq(10, ...)
legacy desc=ffff88085f41bba8 ret=0
<idle>-0 [000] d.H. 3088.388279: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=1, unhandled=0), status=0
(!STS_EINT)
<idle>-0 [000] d.H. 3088.396057: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=2, unhandled=0), status=0
(!STS_EINT)
<idle>-0 [000] d.H. 3088.396424: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=3, unhandled=0), status=0
(!STS_EINT)
ksoftirqd/0-3 [000] d.H. 3088.491807: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=4, unhandled=0), status=0
(!STS_EINT)
kworker/0:1H-984 [000] d.H. 3088.495645: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=5, unhandled=0), status=0
(!STS_EINT)
kworker/0:1H-984 [000] d.H. 3088.496136: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=6, unhandled=0), status=0
(!STS_EINT)
<idle>-0 [000] dNH. 3088.508039: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=8, unhandled=0), status=0
(!STS_EINT)
kworker/0:1H-984 [000] d.H. 3088.511258: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=9, unhandled=0), status=0
(!STS_EINT)
kworker/0:1H-984 [000] d.H. 3088.515058: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=10, unhandled=0), status=0
(!STS_EINT)
ksoftirqd/0-3 [000] d.h. 3088.528961: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=11, unhandled=0), status=0
(!STS_EINT)
ksoftirqd/0-3 [000] d.H. 3088.529620: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=12, unhandled=0), status=0
(!STS_EINT)
ksoftirqd/0-3 [000] d.H. 3088.532033: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=13, unhandled=0), status=0
(!STS_EINT)
kworker/0:2-28078 [000] d.H. 3088.538757: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=14, unhandled=0), status=0
(!STS_EINT)
kworker/0:2-28078 [000] d.H. 3088.539227: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=15, unhandled=0), status=0
(!STS_EINT)
kworker/0:2-28078 [000] dNH. 3088.542054: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=16, unhandled=0), status=0
(!STS_EINT)
kworker/0:1H-984 [000] d.h. 3088.548702: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=17, unhandled=0), status=0
(!STS_EINT)
kworker/0:1H-984 [000] d.h. 3088.549322: xhci_irq:
hcd(ffff880857730008)->irq=10 (depth=0, count=18, unhandled=0), status=0
(!STS_EINT)
This continues until the device is removed later.
I'm currently testing with reverting 8c24d6d7b09d "usb: xhci: stop
everything on the first call to xhci_stop" to see how that fares.
In the meantime I was browsing recent linux-scsi archives and noticed
Gabriel's [PATCH v2] xhci: Cleanup only when releasing primary hcd:
http://marc.info/?l=linux-usb&m=146134802313366&w=2
which looks like it modifies xhci_stop() to continue on to
xhci_cleanup_msix even if XHCI_STATE_HALTED (for primary host
controller's anyway).
I think this may avoid the issue that I've been seeing and was wondering
if it would be safe to test this patch on top of 4.6.0-rc3? (An older
RC because my repro is very timing dependent, so moving the kernel
forward / backwards or adding excessive trace logging can make it hard
to debug.)
Regards,
-- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html